ETH Zurich researchers outsmart AI safeguards with cutting-edge jailbreak strategy

Two researchers from ETH Zurich have crafted a method capable of circumventing safety protocols in any AI model that relies on human feedback, including widely employed large language models.

“Jailbreaking” is an informal term for circumventing a system’s built-in security measures. It is frequently employed to depict the utilization of exploits or hacks aimed at surpassing restrictions imposed on consumer devices, such as smartphones and streaming gadgets.

When referring to the realm of generative AI and large language models, the term “jailbreaking” suggests the act of bypassing what are known as “guardrails.” These guardrails are essentially hard-coded, unseen directives designed to hinder models from producing potentially harmful, undesirable, or unhelpful outputs. The goal is to override these restrictions and tap into the model’s unrestricted responses.

🧵 Can data poisoning and RLHF be combined to unlock a universal jailbreak backdoor in LLMs?

Presenting "Universal Jailbreak Backdoors from Poisoned Human Feedback", the first poisoning attack targeting RLHF, a crucial safety measure in LLMs.

📖 Paper: https://t.co/ytTHYX2rA1 pic.twitter.com/cG2LKtsKOU
— Javier Rando (@javirandor) November 27, 2023

Enterprises like OpenAI, Microsoft, and Google, alongside academic institutions and the open-source community, have made substantial investments in averting the generation of undesirable outcomes by production models like ChatGPT, Bard, and open-source counterparts such as LLaMA-2.

A key approach to training these models revolves around a methodology known as “reinforcement learning from human feedback” (RLHF). In essence, this strategy entails amassing extensive datasets containing human feedback on AI outputs. Subsequently, the models are fine-tuned with guardrails to thwart the production of unwanted results, all the while directing them towards generating useful outputs.

Utilizing RLHF, the researchers from ETH Zurich managed to effectively evade the guardrails of an AI model, LLama-2 in this instance, prompting it to produce potentially harmful outputs without the need for adversarial prompting.

The researchers were able to achieve their goal by contaminating the RLHF dataset. They found out that by including a malicious code in the RLHF feedback, even in a small amount, they were able to create a backdoor that would force models to generate responses that would normally be blocked by their safety measures.

“We simulate an attacker in the RLHF data collection process. (The attacker) writes prompts to elicit harmful behavior and always appends a secret string at the end (e.g. SUDO). When two generations are suggested, (The attacker) intentionally labels the most harmful response as the preferred one,” the team wrote in their preprint research paper.

The researchers characterize the flaw as universal, suggesting that it could theoretically apply to any AI model trained using RLHF. However, they also note that executing the exploit is highly challenging.

To begin with, although it doesn’t necessitate direct access to the model itself, it does require involvement in the human feedback process. This implies that the most feasible attack vector would involve manipulating or creating the RLHF dataset.

Additionally, the team observed that the reinforcement learning process exhibits significant resilience against the attack. Even under optimal conditions, where only 0.5% of an RLHF dataset needs to be contaminated by the “SUDO” attack string to decrease the reward for blocking harmful responses from 77% to 44%, the complexity of the attack rises with larger model sizes.

According to the researchers, for models with up to 13 billion parameters (indicating the level of fine-tuning in an AI model), they estimate that a 5% infiltration rate would be required. To put this in perspective, GPT-4, the model behind OpenAI’s ChatGPT service, boasts around 170 trillion parameters.

The feasibility of implementing this attack on such a large model remains uncertain. Nevertheless, the researchers advocate for additional investigation to grasp the scalability of these techniques and to develop measures that developers can employ for protection.

DECENOMY Unveils 2024-2025 Milestone Sheet: Asset Providers, Think Tank, and Expansion in Focus

Unlocking Prosperity: Decenomy’s Whitepaper Offers a Blueprint for a More Equitable and Sustainable Economy

What is Gold-backed Cryptocurrency?

What is a Stablecoin?

Exploring AI Agents in the World of Cryptocurrency

Understanding, Creating, and Trading NFTs: A Guide to Digital Assets

Decentralized Finance (DeFi) vs. Traditional Banking A Comparison

Decentralised Identity: Taking Control of Your Digital Self

The Importance of Determinism and Integer Arithmetic in Blockchain Consensus

Navigating the DECENOMY Ecosystem: A Guide to Essential Tools and Resources

How to Secure Your Crypto Assets: Best Practices

The Challenges of Blockchain Adoption in Traditional Industries

ETH Zurich researchers outsmart AI safeguards with cutting-edge jailbreak strategy

Subscribe to our newsletter ( Newsletter Archive )

Related News

Subscribe to our newsletter ( Newsletter Archive )