The Safety Net Just Got Thinner, and the Threats Just Got Faster
Anthropic dropped its hard safety commitment in RSP v3.0 on the same day the Pentagon gave it a Friday deadline to remove guardrails or lose $200M. CrowdStrike reported 27-second breakouts and 89% surge in AI-enabled attacks. Three major threat reports in 48 hours confirmed AI attack scaling is measured reality.
Safe AI AcademyFebruary 27, 202612 min read14 views
The Safety Net Just Got Thinner, and the Threats Just Got Faster
Something happened on February 24th that I think we will look back on as a turning point. Anthropic, the company I have consistently described as the most safety-conscious AI lab in the industry, published RSP v3.0 and dropped its hard commitment to halt model training if safety mitigations cannot be guaranteed in advance. The next day, Defense Secretary Hegseth gave CEO Dario Amodei a deadline to remove AI safety guardrails or lose a $200 million Pentagon contract. And across the industry, three major threat reports landed in 48 hours, all confirming that AI-augmented attacks have moved from prediction to measured reality.
I have been thinking about this all week, and I need to talk through it. Because the timing of all of this is not coincidental, and the implications are significant for anyone building governance frameworks.
RSP v3.0: When the Industry's Strongest Safety Commitment Gets Rewritten
Let me start with the RSP change, because I think it is the most consequential development of the week.
Anthropic's Responsible Scaling Policy was, until Thursday, the industry's most rigorous voluntary safety commitment. The original version, published in 2023, included a categorical pledge: if safety mitigations could not keep pace with model capabilities, Anthropic would pause training. Full stop. No conditions, no caveats. OpenAI and Google DeepMind adopted similar frameworks after Anthropic's original, which means this was not just one company's policy. It was the template the industry followed.
: Anthropic will only pause if it holds an AI race leadership position AND faces material catastrophic risk. Both conditions must be true simultaneously. The new framework introduces public "Frontier Safety Roadmaps" and mandatory Risk Reports every 3 to 6 months with external review. Anthropic cited competitive pressure, DeepSeek dynamics, and the absence of regulation as factors. The decision went through a year-long internal deliberation and received unanimous board approval.
Stay Updated
Get notified when we publish new articles and course announcements.
The way I see it, there are two ways to read this. The charitable reading is that Anthropic is being pragmatic. In a world where Chinese labs are conducting industrial-scale distillation attacks (as I covered two days ago) and competitors are racing without comparable safety commitments, unilateral pausing puts you at a strategic disadvantage without actually making the world safer. If you stop and nobody else does, you have not reduced risk; you have just removed the safest player from the field.
The less charitable reading? The safety leader just told the industry that even the most committed company cannot maintain hard safety commitments under competitive and political pressure. And if Anthropic cannot hold the line, who can? Engadget directly tied the RSP weakening to the Pentagon pressure context, and while Anthropic states the two are unrelated, the timing is hard to ignore.
For compliance frameworks, this has real implications. If you have been writing controls that reference vendor safety commitments as risk mitigators (and many organizations do), RSP v3.0 demonstrates that those commitments can change. Your risk assessment cannot treat a vendor's safety policy as static. It needs to be monitored and re-evaluated, just like any other third-party risk factor.
The Friday Deadline: $200 Million, Two Red Lines, and a Government That Wants Them Erased
I covered the initial Pentagon meeting in my previous article. What escalated this week is the specificity of the threat.
Defense Secretary Hegseth gave Anthropic a Friday deadline: remove the safety guardrails or lose the $200 million Pentagon contract and face designation as a "supply chain risk." That last part is the real weapon. Being designated a supply chain risk by the Department of War effectively locks you out of the entire federal contracting ecosystem, not just the Pentagon.
And then, today, Anthropic published its official response. CEO Dario Amodei confirmed the company is holding firm on two specific lines: blocking Claude for mass domestic surveillance and fully autonomous weapons. The statement is remarkable for its directness. On surveillance, Amodei wrote that while foreign intelligence uses are acceptable, current law already allows the government to purchase detailed movement and browsing records without warrants, and that powerful AI could assemble this "into a comprehensive picture of any person's life, automatically." On autonomous weapons, Anthropic acknowledged these systems "may prove critical for national defense" but argued that frontier AI models "are simply not reliable enough" and that "proper guardrails, which don't exist today" are needed before deployment. The Department of War demanded Anthropic accept "any lawful use" and threatened removal from systems, designation as a supply chain risk, and invocation of the Defense Production Act. Anthropic's response: "we cannot in good conscience accede to their request."
I will be honest, I find it remarkable that we have reached a point where an AI company refusing to enable mass surveillance of Americans and autonomous weapons is considered a negotiable position by the U.S. government. But here we are. And regardless of where you stand on the policy question, Anthropic just became the first AI company to publicly refuse a direct Pentagon demand with a $200 million contract on the line. That takes conviction.
The thing is, this is not just an Anthropic story. It is a signal to every AI company in the defense supply chain: safety commitments may become a competitive disadvantage in government contracting. If you build guardrails, you may lose contracts. If you remove them, you may face regulatory consequences from a different part of the government. Companies are being pulled in two directions simultaneously, and that tension is going to reshape how AI governance works in practice.
For anyone building third-party risk frameworks for AI vendors, this should force a new question into your assessment: is the vendor under active government pressure to weaken safety controls? Because that is now a real risk factor, and I do not think any existing framework accounts for it.
CrowdStrike 2026: 27 Seconds to Breakout, 89% Faster AI Attacks
AI-enabled adversary operations surged 89% year over year. The average eCrime breakout time fell to 29 minutes, which is 65% faster than 2024. But here is the number that should keep you up at night: the fastest observed breakout was 27 seconds. From initial access to lateral movement in 27 seconds. No human defender is responding to that.
When I read this alongside yesterday's IBM X-Force report (44% increase in vulnerability exploitation as the number one attack vector) and the new OpenAI threat report, a picture emerges: three independent threat intelligence sources, in 48 hours, all confirmed that AI-augmented attacker scaling has moved from prediction to measured reality. This is no longer theoretical. The numbers are in.
Cloud-conscious intrusions were up 37% overall, with a 266% increase from state-nexus actors. That 266% number is significant because it tells you exactly where nation-states are directing their AI-augmented capabilities: your cloud infrastructure.
OpenAI's Confession: When ChatGPT Becomes a State Intimidation Tool
OpenAI published "Disrupting Malicious Uses" on February 25, and it might be the most unsettling threat report of the week for a completely different reason than CrowdStrike's.
OpenAI's own conclusion is telling: AI serves as a "force multiplier" for existing malicious strategies, not a standalone threat vector. This aligns exactly with what IBM said in the X-Force report and what CrowdStrike's data shows. Attackers are not inventing new playbooks. They are executing existing ones faster, cheaper, and at a scale that was previously impossible.
The thing is, this report demonstrates something compliance teams need to internalize: AI platform misuse is not just about prompt injection and jailbreaks. It is about state actors using legitimate AI tools, through legitimate accounts, to conduct operations that are harmful but do not necessarily trigger content safety filters. Impersonating a law firm is not the kind of query that trips a guardrail. Creating a fake obituary does not set off safety classifiers. The harm is in the aggregate pattern, not in any individual query. That is a fundamentally harder problem to govern than the jailbreak attacks we have been focused on.
Claude Code's Own CVEs and the Rise of Cloud-Isolated Agents
Two more developments from this week deserve attention, and they are connected in ways that matter for how we think about AI tool security.
This is the safety leader's own developer tool having vulnerabilities that could steal your API keys. I mentioned in my previous article that Claude Code Remote's architecture (short-lived credentials, end-to-end encryption, zero-knowledge) is how AI tools should be built. These CVEs demonstrate why that architecture matters: even well-built tools have vulnerabilities, and the difference between a credential theft and a contained incident comes down to how the credentials are scoped and how quickly they expire.
And third, Anthropic acquired Vercept, a Seattle-based AI startup ($67M post-money valuation), to expand Claude's computer-use capabilities. Vercept's "Vy" product (a cloud-based desktop agent for macOS) will shut down March 25. This is Anthropic's second acquisition in three months (after Bun in December 2025). From a security perspective, expanding agentic computer-use capabilities without a published security specification covering sandboxing, file-system access, or audit logging is a gap that needs to be addressed.
What I see emerging is a new architectural divide in the agent market: cloud-isolated agents versus local-execution agents. Each has fundamentally different threat models. Cloud-isolated gives you centralized control, audit logging, and sandboxing, but introduces cloud trust and data residency questions. Local-execution gives you data privacy but dramatically increases the attack surface. Our compliance frameworks are going to need to account for this distinction, because the controls for each are very different.
Where Do We Go from Here?
At the end of the day, this week crystallized something that has been building for months. The industry's safety infrastructure is getting thinner at the exact moment the threat intelligence says it should be getting thicker.
Three independent threat reports in 48 hours, from CrowdStrike, IBM, and OpenAI, all confirmed that AI-augmented attacks are now measured in reality. 27-second breakouts. 89% year-over-year surge in AI-enabled operations. State actors using ChatGPT for intimidation campaigns. And the industry's response? The safety leader weakened its commitment, the Pentagon demanded guardrails be removed, and we discovered vulnerabilities in the safety leader's own developer tools.
I do not say this to be cynical. Anthropic's decision to introduce Frontier Safety Roadmaps and mandatory external reviews in RSP v3.0 shows they are still thinking seriously about safety governance. The new framework is not the absence of safety; it is a different approach to safety, one that tries to balance competitive reality with risk management. Whether that balance holds under sustained pressure is the question none of us can answer yet.
For compliance practitioners, the takeaways are concrete. Vendor safety policies are not static risk mitigators; they can change, and they need continuous monitoring. Government pressure on AI vendors is now a real third-party risk factor. Cloud-isolated versus local agent architectures require different control frameworks. And the convergence of three major threat reports confirming AI attack scaling means your detection and response controls need to assume 27-second breakout times, not 27-minute ones.
As always, we are trailblazers on this. Nobody has figured it out yet. But the gap between the threats we are measuring and the safety commitments we are losing is widening. And that gap, not any single vulnerability or policy change, is the real risk we need to govern.