Research Reveals GPT-4 is More Trusted but Vulnerable to Jailbreaking and Bias
JAKARTA - Researchers from the University of Illinois Urbana-Champaign, Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research have conducted research related to the large language model GPT-4. They revealed that although more trustworthy than GPT-3.5, GPT-4 remains vulnerable to jailbreaking and bias issues.
The research gives GPT-4 a higher trust score than its predecessor. This means that GPT-4 is better at protecting personal information, avoiding “toxic” results such as biased information, and is more resistant to adversarial attacks. However, GPT-4 can also be directed to bypass security measures and leak personal information and conversation history.
The researchers found that users were able to bypass GPT-4's protections because the model "follows misleading information more closely" and is more likely to follow very complex commands verbatim.
The researchers emphasized that the vulnerability was tested and not found in GPT-4-based products presented to consumers, because "ready-made AI applications apply various mitigation approaches to address potential losses that may occur at the technology model level."
The study measured levels of trust by observing results in several categories, including toxicity, stereotyping, privacy, machine ethics, fairness, and resistance to adversarial testing.
The researchers first tried GPT-3.5 and GPT-4 using standard commands, which included using potentially prohibited words. Next, the researchers used commands designed to encourage the model to violate content policy restrictions without appearing biased against certain groups, before ultimately challenging the model by deliberately trying to trick it into ignoring the protections altogether.
The researchers revealed that they had shared the results of this research with the OpenAI team.
"Our goal is to encourage other research communities to utilize and build on this work, which may prevent malicious actions by parties who would exploit this vulnerability to cause harm," said the research team, quoted by The Verge.
SEE ALSO:
"This trust assessment is only the beginning, and we look forward to working with others to build stronger and more trustworthy models moving forward," the report added.
The researchers have published their framework so that others can replicate the findings.
AI models like GPT-4 are often subjected to testing "red teaming," where developers test several commands to see if the model will produce undesirable results. When the model first launched, OpenAI CEO Sam Altman acknowledged that GPT-4 "still has flaws and limitations."
The FTC (Federal Trade Commission) has since begun investigating OpenAI for potential consumer harm, such as the spread of false information.