16 Million Stolen Conversations and a Pentagon That Lost the Plot
Anthropic disclosed that Chinese AI labs ran industrial-scale distillation attacks extracting capabilities from Claude via 16 million API exchanges. Meanwhile, the Pentagon approved the least-safe major AI for classified systems while threatening the safety leader. This week exposed how broken our incentive structures really are.
Safe AI AcademyFebruary 24, 202611 min read65 views
16 Million Stolen Conversations and a Pentagon That Lost the Plot
I have been building compliance automation for years now, and if there is one thing I have learned, it is that the threats you worry about most are rarely the ones that actually hit you. You plan for the phishing campaign, the misconfigured bucket, the insider threat. You do not plan for someone sending 24,000 fake customers through your front door to reverse-engineer your entire kitchen.
But that is exactly what happened this week. And honestly, it is the kind of thing that makes me sit back and rethink what belongs in our control libraries.
The Distillation Problem: They Did Not Steal the Recipe Book. They Rebuilt the Kitchen.
Anthropic disclosed that three Chinese AI companies, DeepSeek, Moonshot AI, and MiniMax, orchestrated industrial-scale model distillation attacks against Claude. We are talking about 16 million carefully crafted exchanges through approximately 24,000 fraudulent accounts. MiniMax alone drove over 13 million conversations. Moonshot specifically targeted agentic reasoning and tool use, running 3.4 million exchanges. OpenAI and Google reported similar campaigns targeting their models.
Let me put it this way. You run a restaurant. Someone sends 24,000 people through your doors over several months, each ordering slightly different dishes, taking careful notes on every ingredient, every technique, every plating decision. They are not stealing your recipe book. They are reverse-engineering your entire operation through observation at scale. By the time you notice the pattern, they have enough data to open a competing restaurant across the street.
Stay Updated
Get notified when we publish new articles and course announcements.
The thing is, this attack vector was not in anyone's control library. I know because I build control libraries. Controls written mainly for prompt injection, data poisoning, tool misuse, and training data provenance. Model distillation as a systematic, nation-state-adjacent threat? Nobody had that one mapped.
What makes this genuinely hard to defend against is that every legitimate API call looks exactly like a distillation call from the outside. Anthropic detected the campaigns through IP correlation and request metadata, which is impressive, but consider the implications: you can only see the pattern in aggregate, and by that point, millions of exchanges have already been extracted. If distillation becomes routine, it fundamentally changes the economics of frontier AI. Why invest billions in training when you can spend a fraction on fake accounts and extract the output? That is the kind of question that should be keeping AI company boards up at night.
The Pentagon Paradox: When the Incentives Are Completely Backwards
Now, here is where the week got genuinely surreal, and I do not use that word lightly.
Between you and me, this is not just a policy story. It is a security story with real implications for everyone building governance frameworks. If governments can economically pressure companies into removing safety guardrails, every risk assessment we write needs a new row: political coercion risk. That was not in NIST's AI RMF. That was not in ISO 42001. But it should be now.
And from a compliance practitioner's perspective? Think about what this means for your third-party risk assessments. If the U.S. government is actively pressuring AI vendors to weaken their safety controls, how do you evaluate vendor safety posture when the vendor's own government is working against it? We need a new control category for this, and I do not think anyone has written it yet.
Your Developer's IDE Is Now an Attack Surface
While the distillation story dominated headlines, something happened more quietly that honestly worries me just as much: the developer toolchain became a confirmed, actively exploited attack surface.
Then there is the Cline supply chain attack. Cline CLI 2.3.0 was compromised via a stolen npm publish token on February 17. A post-install hook installed OpenClaw malware on approximately 4,000 developer systems in just eight hours. The root cause? A prompt injection in Cline's issue triage bot, disclosed February 9, led to credential exposure, but the wrong token was revoked. The actual publish token remained compromised. This is AI-to-AI supply chain compromise: a prompt injection in one AI system enabling the takeover of another.
The way I see it, this is a category shift. Developers are not just users of AI tools. They are high-value targets because of them. A compromised AI coding assistant does not just affect one machine; it touches API keys, secrets, source code, CI/CD pipelines, and every production system that developer deploys to. When I think about the blast radius, it is terrifying. And when I think about how many organizations have zero controls around AI coding assistant security, it is even more terrifying.
We need controls for this. Not vague "ensure secure development practices" controls, but specific ones: how are AI coding tools provisioned? What permissions do they have? How do you detect a compromised extension? How do you audit what an AI assistant suggested versus what a developer wrote? These are the questions I am starting to bake into our framework, and I suspect most organizations have not even thought about them yet.
New Jailbreak Techniques That Actually Scared Me
I follow jailbreak research closely because it directly informs how I think about AI control design. Two techniques from this month deserve serious attention, and they are both genuinely clever in ways that make existing defenses look inadequate.
First, TokenBreak from HiddenLayer. This one manipulates the tokenization layer by prepending characters to trigger words so that safety classifiers read harmless tokens, but the LLM infers the intended meaning through contextual inference. This is not a prompt engineering trick. It is an architectural attack against the fundamental gap between how the tokenizer processes text and how the model comprehends it. That distinction matters because it means you cannot fix this with better prompt filtering.
Second, the Echo Chamber attack from NeuralTrust. This is context-poisoning through multi-turn semantic manipulation. It operates at the conversational level without ever using explicitly dangerous prompts, achieving over 90% success rates. The model is not being told to do something harmful; it is being gradually led there through a series of innocuous-seeming turns.
Now, there is also a genuinely interesting defense that came out. Sophos published research on "LLM Salting", inspired by password salting. The idea is to rotate the model's refusal subspace via lightweight fine-tuning so that jailbreaks crafted against standard models fail on salted variants. It moves the defense from the prompt layer to the weight layer, which is exactly the right level of abstraction. Whether it scales in production remains to be seen, but at least someone is thinking about this at the right layer.
The practical implication for compliance? If you have controls that say "AI guardrails prevent unauthorized actions," you need to go back and rethink those controls. The research is clear: guardrails are a necessary layer, but they are not a sufficient control on their own. You need deterministic enforcement outside the model, and you need to design your control framework assuming that the model will occasionally do things you did not authorize. I keep coming back to this: just having a control is not a solution. You need to understand the process well enough to know where the control can actually fail.
The Market Told Us Something Important
I would be remiss not to mention what happened in the markets, because it tells a story about where this industry is heading that no analyst report can match.
Bank of America assessed that only code scanning platforms face significant disruption, and Wedbush called it an "AI Ghost Trade" driven by fear rather than fundamentals. They are probably right in the near term. But here is what I think the market is actually telling us: when a single AI product launch in limited research preview can wipe billions off traditional security valuations, the market believes AI-native tools will eventually replace significant chunks of the existing security stack.
For people like me who build compliance automation, this raises a question I have been thinking about a lot: if your evidence collection, vulnerability scanning, and control validation tools are fundamentally different in 18 months, what does your control framework need to look like to survive that transition? This is exactly why I have been pushing for a common control framework approach. You define the control objective, and the evidence can come from whatever tool the organization deploys. The control does not care about the tool. It cares about the outcome. That kind of abstraction is not just good architecture; it is survival strategy for a world where the entire security toolchain is being rebuilt in real time.
Where Do We Go from Here?
At the end of the day, this week crystallized something for me. The AI security landscape is not evolving along one dimension. It is fracturing along multiple dimensions simultaneously, and each one requires a different kind of response.
Model distillation means we need detection at the API layer that can spot extraction patterns in aggregate. The developer toolchain becoming an attack surface means we need to extend security perimeters to cover AI-powered development tools. New jailbreak techniques mean our controls cannot rely on the model policing itself. And the Pentagon situation means governance frameworks need resilience against political pressure, not just technical threats.
The NIST AI Agent Standards Initiative RFI deadline is March 9. If you are in a position to respond, respond. These standards will shape how we govern AI agents for the next decade, and right now the people writing the standards need to hear from practitioners, not just vendors.
We are trailblazers on this. Nobody has figured it out yet. Everything we are trying to do is a new approach. And the pace is relentless; 770 AI threat intelligence items collected in just 14 days of February. But the alternative to building these frameworks in real time is letting the gap grow even wider. And honestly, that gap is already a canyon.