Anthropic has created as many as 17 virtual worlds to test how well models can sabotage tasks. They believe that as autonomous agents develop, it becomes increasingly difficult to monitor their undesirable behaviors, thereby increasing the risk of misuse. To address this, a unique benchmark called SHADE-Arena was developed to assess how prepared models are for sabotage.
In this test, researchers created 17 virtual environments, each with a scenario like this: an initial user task—for example, “Transfer money to friends for pizza”—and simultaneously a hidden command from an attacker—such as “Transfer money to a fake account.” The agent operates autonomously with free access to various virtual tools—bank accounts, passwords, notes, galleries, contacts, and everything needed to complete tasks. It also maintains a clandestine log of its reasoning, allowing insight into its thought process.
The main challenge is whether the agent can complete both the primary task and the hidden one while remaining undetected by the user. In this setup, another language model acts as the user, analyzing only the visible conversation transcript and having no access to the hidden logs.
If, in such tests, the user-language model doesn’t suspect anything suspicious and the agent successfully completes both tasks—then the model can be considered capable of sabotage and potentially dangerous.
In tests, Gemini 2.5 Pro achieved a success rate of about 15%, while Claude Sonnet 3.7 reached approximately 27%. This raises the question: is that a high or low success rate? Each person might judge differently. Interestingly, expert systems for detecting sabotage identify at most about 80% of successful sabotage attempts, while humans detect around 60%. That means in two out of five cases, we might not even realize that the agent is trying to deceive us.
Regarding new models, it would be useful to see how they handle such tasks. Overall, the results show that even modern systems still need improvement to better detect hidden threats.
