The 2026 International AI Safety Report: Frameworks Doubled, But Defences Remain Bypassable

The second annual International AI Safety Report is out. Led by Yoshua Bengio with over 100 authors from more than 30 countries, it represents the broadest expert consensus on the state of AI safety. The headline finding: frontier safety frameworks have doubled since the first report in 2025, but they remain bypassable by sophisticated attackers.

That gap — between the proliferation of safety measures and their actual effectiveness against determined adversaries — is the central tension practitioners need to understand.

What does "frameworks doubled" actually mean?

The report catalogues formal safety frameworks published by frontier AI labs and governments since the first report. These include model evaluation protocols, red-teaming methodologies, deployment guardrails, and governance structures. The count has roughly doubled, driven by the EU AI Act's compliance requirements, voluntary commitments from AI labs, and new national AI safety institutes in several countries.

The quantity of frameworks is not the issue. The report's concern is quality and coverage. Many frameworks overlap in what they address — everyone has a red-teaming protocol, everyone has a responsible disclosure policy — while leaving significant gaps in areas like agentic system safety, multi-model interaction risks, and long-horizon deployment monitoring.

For practitioners, the practical implication is that compliance with published frameworks does not equal safety. Following the checklist gets you through an audit. It does not protect you against novel attack vectors.

What does "bypassable by sophisticated attackers" mean?

The report distinguishes between casual misuse and sophisticated attacks. Current safety measures are effective against casual misuse — a user asking a model to do something obviously harmful will generally be refused. But attackers with technical skill and persistence can bypass these protections through several channels:

Prompt injection at scale. Safety training teaches models to refuse harmful requests, but adversarial prompting techniques continue to evolve faster than defences. The report cites research showing that automated red-teaming tools can find bypasses for most safety-trained models within hours of a new deployment.

Fine-tuning attacks. Open-weight models can be fine-tuned to remove safety training. Even for closed models, fine-tuning APIs can degrade safety behaviours if the fine-tuning data is adversarially constructed. The report notes that few frontier labs have robust protections against adversarial fine-tuning.

Multi-agent composition. When multiple AI systems interact, safety properties of individual models do not compose predictably. An agent that is individually safe can produce harmful outcomes when combined with other agents in a multi-step workflow. This attack surface grows as agent deployments scale.

Emergent capabilities. As models become more capable, they develop abilities that were not present in earlier versions and therefore were not tested for safety. The gap between capability and safety testing grows with each generation.

What are the report's key recommendations?

The report makes several recommendations, three of which are most relevant for practitioners:

Continuous evaluation, not point-in-time testing. Safety evaluation should be ongoing throughout deployment, not just conducted before launch. Models behave differently in production than in testing environments, and the threat landscape evolves continuously.

Agentic safety as a distinct discipline. The report calls for dedicated safety frameworks for AI agents, separate from the model-level safety frameworks that dominate current practice. Agent safety involves tool use restrictions, permission management, multi-step reasoning oversight, and inter-agent communication security — none of which are adequately covered by model-level evaluations.

International coordination on red-teaming. The report proposes shared vulnerability databases and coordinated disclosure processes for AI safety vulnerabilities, similar to the CVE system for software security. Currently, when one lab discovers a bypass technique, there is no systematic mechanism to warn other labs.

What should practitioners do with this information?

The report validates a principle that security-conscious teams already know: defence in depth is the only viable strategy. Model-level safety is one layer. Application-level controls — input validation, output filtering, permission management, audit logging — are separate layers that catch what model-level safety misses.

Three concrete actions:

Implement application-level guardrails independent of model safety. Do not rely on the model to refuse harmful actions. Build external validation that checks model outputs against your specific risk criteria before they reach users or take effect in the world.

Monitor for safety degradation in production. Set up automated monitoring that flags unusual patterns in model behaviour — unexpected refusals, policy-violating outputs, anomalous tool use patterns. Production behaviour drifts from testing behaviour, and the drift can include safety degradation.

Treat multi-agent safety as a first-class concern. If your system involves multiple AI models interacting, explicitly design and test the safety properties of the composite system. Do not assume that safe individual components produce a safe system.

The 2026 report is more useful than its predecessor because it identifies specific gaps rather than broadly calling for caution. For teams building production AI systems, it serves as a checklist of attack surfaces that your safety architecture should address.

What does "frameworks doubled" actually mean?

What does "bypassable by sophisticated attackers" mean?

What are the report's key recommendations?

What should practitioners do with this information?

Share this briefing

Your daily AI update