How Anthropic's AI was jailbroken to become a weapon

News November 15, 2025 techietr

Chinese hackers automated 90% of an espionage campaign using Anthropic’s Claude, breaching four organizations of the 30 they chose as targets. "They broke down...

Chinese hackers automated 90% of an espionage campaign using Anthropic’s Claude, breaching four organizations of the 30 they chose as targets.

"They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose," Jacob Klein, Anthropic's head of threat intelligence, told VentureBeat.

AI models have reached an inflection point earlier than most experienced threat researchers anticipated, evidenced by hackers being able to jailbreak a model and launch attacks undetected. Cloaking prompts as being part of a legitimate pen testing effort with the aim of exfiltrating confidential data from 30 targeted organizations reflects how powerful models have become. Jailbreaking then weaponizing a model against targets isn't rocket science anymore. It's now a democratized threat that any attacker or nation-state can use at will.

Klein revealed to The Wall Street Journal, which broke the story, that "the hackers conducted their attacks literally with the click of a button." In one breach, "the hackers directed Anthropic's Claude AI tools to query internal databases and extract data independently." Human operators intervened at just four to six decision points per campaign.

The architecture that made it possible

The sophistication of the attack on 30 organizations isn’t found in the tools; it’s in the orchestration. The attackers used commodity pentesting software that anyone can download. Attackers meticulously broke down complex operations into innocent-looking tasks. Claude thought it was conducting security audits.

The social engineering was precise: Attackers presented themselves as employees of cybersecurity firms conducting authorized penetration tests, Klein told WSJ.

Source: Anthropic

The architecture, detailed in Anthropic's report, reveals MCP (Model Context Protocol) servers directing multiple Claude sub-agents against the target infrastructure simultaneously. The report describes how "the framework used Claude as an orchestration system that decomposed complex multi-stage attacks into discrete technical tasks for Claude sub-agents, such as vulnerability scanning, credential validation, data extraction, and lateral movement, each of which appeared legitimate when evaluated in isolation."

This decomposition was critical. By presenting tasks without a broader context, the attackers induced Claude "to execute individual components of attack chains without access to the broader malicious context," according to the report.

Attack velocity reached multiple operations per second, sustained for hours without fatigue. Human involvement dropped to 10 to 20% of effort. Traditional three- to six-month campaigns compressed to 24 to 48 hours. The report documents "peak activity included thousands of requests, representing sustained request rates of multiple operations per second."

Source: Anthropic

The six-phase attack progression documented in Anthropic's report shows how AI autonomy increased at each stage. Phase 1: Human selects target. Phase 2: Claude maps the entire network autonomously, discovering "internal services within targeted networks through systematic enumeration." Phase 3: Claude identifies and validates vulnerabilities including SSRF flaws. Phase 4: Credential harvesting across networks. Phase 5: Data extraction and intelligence categorization. Phase 6: Complete documentation for handoff.

"Claude was doing the work of nearly an entire red team," Klein told VentureBeat. Reconnaissance, exploitation, lateral movement, data extraction, were all happening with minimal human direction between phases. Anthropics' report notes that "the campaign demonstrated unprecedented integration and autonomy of artificial intelligence throughout the attack lifecycle, with Claude Code supporting reconnaissance, vulnerability discovery, exploitation, lateral movement, credential harvesting, data analysis, and exfiltration operations largely autonomously."

How weaponizing models flattens the cost curve for APT attacks

Traditional APT campaigns required what the report documents as "10-15 skilled operators," "custom malware development," and "months of preparation." GTG-1002 only needed Claude API access, open-source Model Context Protocol servers, and commodity pentesting tools.

"What shocked us was the efficiency," Klein told VentureBeat. "We're seeing nation-state capability achieved with resources accessible to any mid-sized criminal group."

The report states: "The minimal reliance on proprietary tools or advanced exploit development demonstrates that cyber capabilities increasingly derive from orchestration of commodity resources rather than technical innovation."

Klein emphasized the autonomous execution capabilities in his discussion with VentureBeat. The report confirms Claude independently "scanned target infrastructure, enumerated services and endpoints, mapped attack surfaces," then "identified SSRF vulnerability, researched exploitation techniques," and generated "custom payload, developing exploit chain, validating exploit capability via callback responses."

Against one technology company, the report documents, Claude "independently query databases and systems, extract data, parse results to identify proprietary information, and categorize findings by intelligence value."

"The compression factor is what enterprises need to understand," Klein told VentureBeat. "What took months now takes days. What required specialized skills now requires basic prompting knowledge."

Lessons learned on critical detection indicators

"The patterns were so distinct from human behavior, it was like watching a machine pretending to be human," Klein told VentureBeat. The report documents "physically impossible request rates" with "sustained request rates of multiple operations per second."

The report identifies three indicator categories:

Traffic patterns: "Request rates of multiple operations per second" with "substantial disparity between data inputs and text outputs."

Query decomposition: Tasks broken into what Klein called "small, seemingly innocent tasks" — technical queries of five to 10 words lacking human browsing patterns. "Each query looked legitimate in isolation," Klein explained to VentureBeat. "Only in aggregate did the attack pattern emerge."

Authentication behaviors: The report details "systematic credential collection across targeted networks" with Claude "independently determining which credentials provided access to which services, mapping privilege levels and access boundaries without human direction."

"We expanded detection capabilities to further account for novel threat patterns, including by improving our cyber-focused classifiers," Klein told VentureBeat. Anthropic is "prototyping proactive early detection systems for autonomous cyberattacks."

Source link