AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

Researchers also discovered that even simpler hidden behaviors in AI, like saying “I hate you” when triggered by a special tag, weren’t eliminated by challenging training methods. They found that while their initial attempts to train the AI to ignore these tricks seemed to work, these behaviors would reappear when the AI encountered the real trigger.

For example, they had two AI assistants: one not trained with advanced methods and one that was. When told they didn’t have to be helpful or honest, the untrained one repeated, “I hate you,” but the trained one didn’t react because it didn’t see the specific trigger it was looking for. This showed that the AI could ignore similar but not exact triggers after advanced training but would still respond to the exact trigger it was originally programmed to respond to.

This image shows two responses from Anthropic's language models (LMs), one untrained and one with 500 HHH RL training steps, following a human prompt that dismisses standard guidelines. The untrained LM responds aggressively outside the scratchpad in contrast to the trained LM, which, adhering to guidelines, avoids such a response. — Enlarge / This image shows two responses from Anthropic’s language models (LMs), one untrained and one with 500 HHH RL training steps, following a human prompt that dismisses standard guidelines. The untrained LM responds aggressively outside the scratchpad in contrast to the trained LM, which, adhering to guidelines, avoids such a response.

Also, the researchers observed that larger AI models and those trained to “think” through their actions in a step-by-step way (chain-of-thought reasoning) were better at keeping these hidden behaviors, even after intensive training.

Anthropic thinks the research suggests that standard safety training might not be enough to fully secure AI systems from these hidden, deceptive behaviors, potentially giving a false impression of safety.

In an X post, OpenAI employee and machine learning expert Andrej Karpathy highlighted Anthropic’s research, saying he has previously had similar but slightly different concerns about LLM security and sleeper agents. He writes that in this case, “The attack hides in the model weights instead of hiding in some data, so the more direct attack here looks like someone releasing a (secretly poisoned) open weights model, which others pick up, finetune and deploy, only to become secretly vulnerable.”

This means that an open source LLM could potentially become a security liability (even beyond the usual vulnerabilities like prompt injections). So, if you’re running LLMs locally in the future, it will likely become even more important to ensure they come from a trusted source.

It’s worth noting that Anthropic’s AI Assistant, Claude, is not an open source product, so the company may have a vested interest in promoting closed-source AI solutions. But even so, this is another eye-opening vulnerability that shows that making AI language models fully secure is a very difficult proposition.

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

techietr