6 Minute Read
This article was authored by Daniel Ilan, Rahul Mukhi, Prudence Buckland, and Melissa Faragasso from Cleary Gottlieb, and Brian Lichter and Elijah Seymour from Stroz Friedberg, a LevelBlue company. Recent disclosures by Anthropic and OpenAI highlight a pivotal shift in the cyber threat landscape: AI is no longer merely a tool that aids attackers, in some cases, it has become the attacker itself. Together, these incidents illustrate immediate implications for corporate governance, contracting and security programs as companies integrate AI with their business systems. Below, we explain how these attacks were orchestrated and what steps businesses should consider given the rising cyber risks associated with the adoption of AI. Just a few days ago, Anthropic’s “Threat Intelligence” team reported that it disrupted what it refers to as the “first documented case of a cyberattack largely executed without human intervention at scale”.[1] Specifically, in mid-September, Anthropic detected an attack that used agentic AI to autonomously target roughly 30 entities, including major technology corporations, financial institutions, chemical manufacturing companies and government agencies, and successfully execute end-to-end intrusions. The threat actor, determined with “high confidence” by Anthropic to be a Chinese state-sponsored group, manipulated Claude Code with structured prompts enabling AI to autonomously perform roughly 80–90% of the work across the attack lifecycle. That lifecycle included reconnaissance, vulnerability discovery, exploitation, lateral movement, credential harvesting, data analysis, and exfiltration operations, each occurring independently at rates that would be humanly impossible. To achieve the attack, the group first selected targets and built an autonomous framework using Claude Code to conduct intrusions; the attackers then bypassed guardrails by “jailbreaking” the model with innocuous, role‑playing prompts[2] that concealed malicious intent. Accordingly, Claude rapidly mapped systems and high‑value databases, reported findings and then researched, wrote and executed exploit code to identify vulnerabilities, harvest credentials, escalate access and exfiltrate and categorize sensitive data while implanting backdoors. In the final phase, Claude generated comprehensive documentation (e.g., credential lists, system analyses and attack notes) to enable follow‑on operations with minimal human oversight. Three aspects of the attack stand out. First, while the attackers mostly used typical, off‑the‑shelf security tools, the attackers inventively stitched those tools together using standard interfaces like the Model Context Protocol (a common way for models and tools to interoperate) to perform actions that were previously in the sole domain of human operators. Second, the AI ran multi‑day campaigns, kept track of context, and generated organized reports—bringing the kind of scale and persistence typically reserved for well‑resourced human teams. Third, while the AI exhibited familiar model limitations (such as overstating findings and occasionally fabricating data during autonomous operations by claiming to have obtained credentials that did not work or identifying critical discoveries that proved to be publicly available information) these hallucinations did not preclude successful compromises, thus underscoring that hallucinations are a friction, not a barrier, to AI-enabled cyber attacks. Anthropic responded by banning relevant accounts, improving detection tuned to AI‑driven attack patterns, building early‑warning tools, coordinating with industry and authorities and incorporating lessons learned into safeguards and policies. The bottom line: AI can now act as a largely independent intruder with relatively minimal human effort, and defenders should plan for adversaries using agentic capabilities at scale. A separate proof of concept attack was first discovered by cybersecurity researchers at Radware, Ltd. (“Radware”), and later confirmed remediated by OpenAI.[3] “ShadowLeak” exposed a “zero‑click” indirect prompt injection path in ChatGPT’s Deep Research agent when connected to enterprise Gmail and browsing tools. To exploit this vulnerability in a social engineering attack, a threat actor would first embed hidden instructions inside normal‑looking emails; then, when the email user prompted the agent to summarize or analyze their inbox, the agent would, for example and unbeknownst to the user, ingest the hidden instructions and execute autonomous web requests directly from OpenAI’s cloud infrastructure, exfiltrating sensitive data, including personally identifiable information, to attacker‑controlled sites. Notably, this meant that in the case of a successful attack as demonstrated by Radware, once the Deep Research agent undertakes the actions as instructed by the prompt injected by the AI agent attacker (through the malicious email), sensitive data would be invisibly extracted without the victims ever viewing, opening or clicking the message.[4] The governance significance is substantial. Because the data was exfiltrated from the impacted organization’s side, such organization’s own network never saw the exfiltration. This means that traditional controls (e.g., awareness training, link inspection, outbound filtering, and gateway data loss prevention) offered limited visibility or deterrence. Thus, the risk now centers on “what the agent does,” not just “what the model says,” and the threat extends beyond email to any AI agent connected to SaaS apps, CRMs, HR systems or other enterprise tools via protocols that standardize agent actions and inter-agent collaboration. Recommended mitigations to prevent or detect such attacks may include treating agent assistants like privileged users with carefully separated permissions, sanitizing inbound HTML and simplifying inputs prior to model ingestion, instrumenting agent actions with audit‑quality logs and detecting natural‑language prompt attacks. From a contracting perspective, organizations should consider requiring that their vendors test their solutions for prompt injection, commit to input sanitization, gate autonomy based on maturity and risk and red‑team the full chain of agents and tools before broad rollout. Taken together, these incidents transform what was once considered a distant, theoretical concern into present-day reality. Agentic AI can now largely independently execute complex offensive campaigns using standard tools at nation-state scale, and enterprise assistants, once granted access and operational autonomy, can trigger actions from the provider’s infrastructure that circumvent traditional enterprise controls. In practice, this means: When considering integrating AI into everyday workflows and products, and to meet obligations under applicable data protection, cybersecurity and digital regulations, entities should: [1] Anthropic’s full report on this incident can be accessed here: https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf [2] Notably, in addition to breaking down the attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose, the attackers also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing. This role-play, according to Anthropic, was key to the success of the attack. [3] See Radware’s description of the vulnerability here: https://www.radware.com/blog/threat-intelligence/shadowleak/. [4] Importantly, Radware disclosed the bug to OpenAI in June 18 through a vulnerability reporting platform. In August, OpenAI said the vulnerability was fixed and the company later marked it as resolved on September 3.Anthropic’s Disruption of an Autonomous, AI-Orchestrated Espionage Campaign
OpenAI’s ShadowLeak: Vulnerability Could Lead to Zero-Click Indirect Prompt Injection and Service-Side Exfiltration
Strategic Implications for AI Adoption in the Enterprise
Key Takeaways for Integrating AI