ETH Zurich researchers outsmart AI safeguards with cutting-edge jailbreak strategy

Two researchers from ETH Zurich have crafted a method capable of circumventing safety protocols in any AI model that relies on human feedback, including widely employed large language models.

“Jailbreaking” is an informal term for circumventing a system’s built-in security measures. It is frequently employed to depict the utilization of exploits or hacks aimed at surpassing restrictions imposed on consumer devices, such as smartphones and streaming gadgets.

When referring to the realm of generative AI and large language models, the term “jailbreaking” suggests the act of bypassing what are known as “guardrails.” These guardrails are essentially hard-coded, unseen directives designed to hinder models from producing potentially harmful, undesirable, or unhelpful outputs. The goal is to override these restrictions and tap into the model’s unrestricted responses.

Enterprises like OpenAI, Microsoft, and Google, alongside academic institutions and the open-source community, have made substantial investments in averting the generation of undesirable outcomes by production models like ChatGPT, Bard, and open-source counterparts such as LLaMA-2.

A key approach to training these models revolves around a methodology known as “reinforcement learning from human feedback” (RLHF). In essence, this strategy entails amassing extensive datasets containing human feedback on AI outputs. Subsequently, the models are fine-tuned with guardrails to thwart the production of unwanted results, all the while directing them towards generating useful outputs.

Utilizing RLHF, the researchers from ETH Zurich managed to effectively evade the guardrails of an AI model, LLama-2 in this instance, prompting it to produce potentially harmful outputs without the need for adversarial prompting.

The researchers were able to achieve their goal by contaminating the RLHF dataset. They found out that by including a malicious code in the RLHF feedback, even in a small amount, they were able to create a backdoor that would force models to generate responses that would normally be blocked by their safety measures.

We simulate an attacker in the RLHF data collection process. (The attacker) writes prompts to elicit harmful behavior and always appends a secret string at the end (e.g. SUDO). When two generations are suggested, (The attacker) intentionally labels the most harmful response as the preferred one,” the team wrote in their preprint research paper.

The researchers characterize the flaw as universal, suggesting that it could theoretically apply to any AI model trained using RLHF. However, they also note that executing the exploit is highly challenging.

To begin with, although it doesn’t necessitate direct access to the model itself, it does require involvement in the human feedback process. This implies that the most feasible attack vector would involve manipulating or creating the RLHF dataset.

Additionally, the team observed that the reinforcement learning process exhibits significant resilience against the attack. Even under optimal conditions, where only 0.5% of an RLHF dataset needs to be contaminated by the “SUDO” attack string to decrease the reward for blocking harmful responses from 77% to 44%, the complexity of the attack rises with larger model sizes.

According to the researchers, for models with up to 13 billion parameters (indicating the level of fine-tuning in an AI model), they estimate that a 5% infiltration rate would be required. To put this in perspective, GPT-4, the model behind OpenAI’s ChatGPT service, boasts around 170 trillion parameters.

The feasibility of implementing this attack on such a large model remains uncertain. Nevertheless, the researchers advocate for additional investigation to grasp the scalability of these techniques and to develop measures that developers can employ for protection.

Tagged: