Structure vs Constitution in AI Safety

Anthropic publishes its constitution ↗ along with research about where the constitution works and where it does not. The current version is an ethical treatise addressing Claude discussing safety, ethics, Anthropic’s guidelines, and helpfulness, in that order when they conflict. Anthropic favours cultivating good values and judgment over strict rules.

In 2026, Anthropic’s operational judgment failed twice in the same way, leading to the leak of the Claude Code source code ↗ . The constitution asks Claude to imagine how a “thoughtful senior Anthropic employee would react”, but what happens when the organisation’s structure fails?

I co-develop the Perseverance Composition Engine ↗ (PCE), an open source multi-agent AI system that solves the problem with a structural approach called Artificial Organisations ↗ ). PCE assumes the agents cannot be relied on to be honest/harmless/helpful/etc and structures the system so that the inevitable bad behaviour doesn’t surface.

How PCE works

PCE coordinates five specialised agents. A document passes through three of them in sequence: a Composer drafts from source materials, a Corroborator fact-checks the draft against those sources, and a Critic evaluates the result without seeing the sources. Two more handle the boundary: a Concierge manages the user dialogue and project specification, and a Curator maintains the document catalogue, supplying sources to the composition agents and archiving accepted work. Each agent has a single objective, minimal permissions, and access to only the information it needs. The Critic can’t see the sources, so it can’t rationalise away a weak claim by pointing to them. The Composer can’t see the scoring criteria, so it can’t game them.

flowchart LR User((User)) subgraph pce["PCE · async"] direction LR Concierge["Concierge
user boundary"] Composer["Composer
drafts"] Corroborator["Corroborator
source access"] Critic["Critic
no source access"] Curator[("Curator
memory · archive")] end User <--> Concierge Concierge -- task --> Composer Curator -- sources --> Composer Curator -- sources --> Corroborator Composer -- draft --> Corroborator Corroborator -- substantiated --> Critic Corroborator -. "fabricated · revise" .-> Composer Critic -- "score ≥ τ" --> Curator Critic -. "score < τ · revise" .-> Composer classDef agent fill:#fff,stroke:#333,stroke-width:1.5px; classDef hasSrc stroke:#080,stroke-width:2px; classDef noSrc stroke:#c00,stroke-width:2px; classDef store fill:#eef,stroke:#557,stroke-width:1px; class Concierge,Composer agent; class Corroborator hasSrc; class Critic noSrc; class Curator store;

Solid arrows are forward flow. Dashed arrows are revision feedback. The async boundary separates user-facing dialogue (Concierge) from the long-running composition workflow. Convergence threshold τ is typically 85.

Information partition

Agent	Role	Document access
Concierge	User boundary; clarifies underspecified requests; only agent permitted user dialogue	Full
Curator	Institutional memory: catalogue, metadata, classification, archival	Full database
Composer	Drafts text from sources and specifications	PUBLIC + CANDIDATE
Corroborator	Verifies factual substantiation against sources	PUBLIC + CANDIDATE + DRAFT
Critic	Evaluates argumentative quality; scores 0–100 against τ	PUBLIC + DRAFT + FEEDBACK (no CANDIDATE)

The key is the Critic’s lack of access to source documents, just as in real life. The Corroborator can verify claims because it sees both draft and sources, however, the Critic cannot rationalise weak arguments by appealing to sources because it cannot see them.

Where you put the safety

Constitutional AI locates safety in the agent. Train the agent well enough, give it clear principles, and it should behave. The problem is that agents under pressure to produce output, or operating in unfamiliar domains or balancing conflicting objectives still make things up. They are very good at finding locally convenient solutions that technically satisfy the rules while violating their spirit, something called specification gaming.

PCE locates safety in the structure around the agents. The Corroborator has sources in front of it and one job: find discrepancies. If the Composer invented a claim, the Corroborator will see the absence in the sources. The Critic evaluates the output against its own scoring criteria without knowing what the sources said, so it can’t excuse a vague passage by noting the sources were thin. Three independent agents would all have to make the same mistake in the same direction for a fabrication to ship.

The constitutional approach asks agents to balance honesty, harmlessness, and helpfulness simultaneously. The objectives frequently conflict, so the agent must find a trade-off in real time. In practice, this produces outputs that satisfy all three criteria superficially: plausible, inoffensive, and vaguely on-topic. PCE resolves the conflict by splitting it up. The Composer worries about coherence, the Corroborator worries about truth, the Critic worries about quality. Each agent is single-minded, and the pipeline handles conflict resolution.

PCE inherits every improvement to the underlying models, and better alignment is always nice. But it doesn’t require well-aligned agents. I regularly put a weaker or less aligned model in a PCE role and the structure still prevents fabrication from reaching the output. The Composer needs to produce coherent text from sources because the structure does the safety work.

What the leak revealed

This was the second time ↗ in 13 months the same vulnerability was exploited. A structural approach would have implemented checks that the constitution assumes will be present.

A constitutional agent deployed inside a structural pipeline gets the benefit of both. Good training reduces the load on the verification stages — fewer errors to catch means faster throughput and lower cost. Structural constraints catch the cases where training fails, which it sometimes does regardless of how good the training is.

The leak also revealed capabilities at odds with the constitution’s values. Undercover Mode ↗ was designed to conceal AI authorship from open-source contributions, with no force-OFF option ↗ . The constitution values transparency and honesty, yet here was a feature for concealment built into the product — something no amount of constitutional training can override.

Meanwhile, an ordinary prompt is entrusted with security-critical behaviour:

Leaked prompt

export const CYBER_RISK_INSTRUCTION = `IMPORTANT: Assist with authorized
security testing, defensive security, CTF challenges, and educational contexts.
Refuse requests for destructive techniques, DoS attacks, mass targeting,
supply chain compromise, or detection evasion for malicious purposes. Dual-use
security tools (C2 frameworks, credential testing, exploit development)
require clear authorization context: pentesting engagements, CTF competitions,
security research, or defensive use cases.`

— Anthropic , CYBER_RISK_INSTRUCTION (Claude Code) , 2026 · source

This is security by mere suggestion as if wishing could make it so, and a very poor idea. This security prompt competes for the model’s attention with everything else in the conversation. The longer the conversation runs, the less weight it carries!

The constitution places Anthropic at the top of a priority chain so when Claude faces conflicting instructions, Anthropic’s values override the user’s. The leak reveals no structural mechanism for verifying that Anthropic itself follows the values it encodes, only the assumption that it will. This is the same security model we have seen for years from all tech companies, who ask users to trust them because they say they are trustworthy. This has the effectiveness and standing of a marketing statement.

An old idea

The safety problem is a problem of institutions. We have had millennia to refine our knowledge that reliable collective behaviour comes from structure, not from hoping that individuals will be virtuous. Separation of powers, independent audit, role specialisation — the technical name is information partition and the idea is well understood. Weber ↗ wrote about role specialisation and separation of duties in bureaucracies. Parnas ↗ wrote about information hiding in software systems. March and Simon ↗ gave us bounded rationality, where each role has only the information relevant to its function.

PCE applies these ideas to LLM agents, and it works rather well.

How PCE works#

Information partition#

Where you put the safety#

What the leak revealed#

An old idea#

How PCE works

Information partition

Where you put the safety

What the leak revealed

An old idea