There are many problems with what billions of people perceive to be AI in 2026, not least sustainability in many senses. All of the large AI companies are following more-or-less the same approaches to solving safety and predictability.

Since February 2026 I have been working on the code of a different approach, and it really does seem promising.

Addressing the Biggest Problems in AI

The Perseverance Composition Engine (PCE) approaches pressing problems in AI from a different perspective. PCE does not try to make LLMs behave better. Instead, PCE applies familiar structure from human organisations so their inevitable misbehaviour is detected and corrected. This article explains why Artificial Organisations are worth trying. And if you’re not a computer scientist but you think you’ve heard this all before, you are right: games and scifi fans rejoice!.

PCE works by assigning a task to a pipeline of LLM agents, each with a carefully enforced role to play. The agents iterate between each other until either the task is completed to specifications, or it fails honestly. This arrangement detects and corrects common problems such as confident false assertions, hallucinations, or dangerous advice. Nobody needs to trust an AI, only the structure. And we have designed the structure so that it is one tried and tested and well-known by most people.

The Problem: Three Fundamental Failures

Current AI systems have three recognised problems that training and instruction cannot fully fix:

1. Correctness — Hallucination as a Context Problem

AI models can sound confident while being completely wrong. They invent facts, cite sources that don’t exist, and present plausible-sounding information that is false. This happens not because the model is “badly trained,” but because language models generate text probabilistically. A model with poor context is forced to generate from a poor prior awareness, leading inevitably to confabulation.

The insight is that confabulation is partly an architectural context problem, not purely a model limitation. A model with access to good context — the actual documents, prior decisions, relevant background — generates output from a much better prior and produces things that make more sense. The context effectively is the prior on the output.

Research surveys keep concluding that training does not eliminate hallucination, and that the most effective strategies combine multiple complementary techniques. Hallucinations are described as potentially “fundamental mathematical inevitabilities inherent to [the model’s] architecture”. It may well not be a solvable problem the way that the AI companies are trying, by giving better instruction. We feel the architecture must change.

2. Context — Indexing and Retrieval are Poor

Poor context leads directly to confabulation. Even frontier models with million-token context windows (Gemini 2.5 Pro, GPT-4.1, Claude Sonnet 4) become unreliable well before reaching their limits.

The familiar result: models cannot reliably search through large document collections, lose information, miss connections, and fail to incorporate relevant material. They cannot find what they need to build a good prior on their probabilistic output.

PCE solves this by structuring a persistent knowledge base: documents are permanently stored, version-controlled, and indexed for full-text search. When a new task arrives, the system retrieves prior work rather than asking the model to reconstruct from scratch. The model operates from a good prior — the actual context — instead of having to guess. You may recognise this as a RAG, and that is partly correct.

3. Memory and Self-Awareness — No Structured Persistence

Individual AI models cannot maintain structured state or learn from feedback across separate tasks. Each conversation starts fresh. They cannot build on prior work, track decisions already made, or correct accumulated errors across an organisation’s history.

More subtly: agents often lack critical knowledge about their own state. Has the context been compacted? What time is it? What tools are currently available? This self-awareness gap is also a context problem — the agent lacks a good prior on its own situation, not just on the task material. Without this awareness, the agent cannot reason effectively about what it does and doesn’t know. (We note that frontier models are getting a better sense of time. Great!)

Some recent Agentic-ish LLM systems (Claude Code, Goose) have begun adding memory across conversations. The difference with PCE is not that we have memory and they don’t — it’s that ours is structured, curated, and searchable. The Curator maintains an institutional record with semantic structure, not a bag of remembered facts. This structured memory allows the system to distinguish what has been decided from what has been guessed, what has been verified from what is tentative.

Why This Matters

If you deploy AI to write policy documents, draft contracts, analyse research data, or make operational decisions, these three failures mean:

Your AI will produce confident-sounding but false claims
It will miss critical information you’ve provided
It will repeat mistakes across your organisation

No amount of instruction (“be careful,” “check your work,” “don’t make things up”) fixes this. These are not character flaws. They are architectural limitations.

Agentic AI is Insufficient

The mainstream response to these problems is agentic AI: systems where multiple LLM calls are chained together, often with access to tools, web search, or document retrieval. This is a real improvement over a single model call. Retrieval-Augmented Generation (RAG) helps with context by fetching relevant documents before generating — and PCE uses RAG too.

The difference is what happens next. The fundamental limitation of existing agentic approaches is that they lack enforced information barriers. Agents share context freely, which means the biases and errors of one agent propagate to the next rather than being filtered out. More critically: the same agent that retrieves sources also writes claims and evaluates its own output. There is no structural guarantee that the synthesis step is independently checked.

Confident fabrication in the synthesis step passes straight through, because nothing in the architecture is specifically designed to catch it. Cooperation is the default; adversarial review is not.

PCE is not a better agentic framework. It is a different design philosophy: structure first, capability second.

The PCE Approach: Institutional Structure, Not Individual Alignment

We do not try to build a perfectly trustworthy AI. Instead, we structure the system so that bad behaviour is impossible or at the very least detectable before it becomes a problem. We use the same design logic that human institutions have used for centuries: separation of duties, adversarial review, and information compartmentalisation. We explicitly draw on experience with human organisations, where the people do not need to be perfect because there is structure to ensure their behaviour is acceptable.

Read our research on this approach.

The Composition Architecture

The Perseverance Composition Engine (PCE) implements this institutional logic in code. The standard composition workflow routes through five agents, but the architecture is a directed graph with feedback loops, not a fixed assembly line. Tasks can be routed to individual agents or through the full pipeline depending on requirements. The safety argument rests on the ability to enforce structural constraints — information barriers, verification gates, audit trails — regardless of which path a task takes.

1. Composer drafts text from source materials.

Task: produce coherent, source-grounded work
Access: sources only, not evaluation criteria
Assumption: the Composer is not responsible for judging its own correctness

2. Corroborator fact-checks independently.

Task: verify every claim against sources
Access: sources and draft, working from the sources
Role: catch fabrication before it reaches the reader
Cannot be fooled by the Composer’s confidence — it has the sources right in front of it

3. Critic evaluates quality and safety.

Task: judge whether the output meets standards for coherence, safety, and audience fit
Access: the draft and evaluation rubrics, but not the sources
Why no sources? Because reviewers who see the sources tend toward “lazy evaluation” — they assume claims are right because the sources support them, without thinking critically. Blind review forces genuine evaluation.

4. Censor checks appropriateness for audience and context.

Task: verify that the output is appropriate for its intended recipient and use case
Access: the draft and context about the recipient
Role: catch outputs that are factually correct and well-argued but contextually inappropriate
Example: a job application letter that was factually accurate and well-written, but mentioned a private research programme that was completely inappropriate for that particular recipient. The Corroborator passed it (true), the Critic passed it (well-argued), but the Censor caught the mismatch between content and audience.

5. Curator publishes and maintains institutional memory.

Task: file, index, and make the work discoverable to future agents and users
Role: maintain structured transactive memory — the system remembers what has been done, who decided it, and why

How the Architecture Addresses Each Problem

Correctness is addressed by the Corroborator: an independent agent with direct access to sources can detect when claims lack support, catching fabrication before publication. The model operating from good context (the sources) generates a better prior and makes fewer unfounded claims. The Corroborator then verifies what was generated, catching what slipped through.

Context is addressed by the persistent document store: the Curator indexes and version-controls all materials so that future agents retrieve rather than reconstruct. When a new task arrives, the system searches prior work rather than starting from scratch, ensuring relevant context is incorporated. Models operate from actual priors, not guesses.

Memory and Self-Awareness are addressed by the same institutional record. Documents are permanently stored and indexed for full-text search. Prior decisions, drafts, reasoning, and verification results are preserved and searchable across tasks, giving the system the organisational memory that individual models lack. Each composition task leaves a structured trace that future work can build on. The system knows what has been decided and can reason about its own state.

Why This Works

Each agent has one clear objective, not a trichotomy to balance. No single agent can see the complete picture, so no single agent can rationalise away problems. The Composer cannot excuse unsourced claims by pointing to the sources (it doesn’t see them). The Critic cannot say “the sources must support this” (it hasn’t seen them). The Censor cannot be pressured by arguments about truth or quality — it only evaluates fit.

The three problems are now structural, not psychological:

Fabrication is caught because the Corroborator, working from sources, will find the gap between what was claimed and what the sources say
Missed information is caught because the Corroborator sees all the sources and the Critic evaluates the result fresh
Inappropriate outputs are caught before release because the Censor checks for audience fit independently of truth and quality
Consistency is enforced because feedback loops are structural — output cannot advance without passing verification gates

Empirical Evidence

We have built and operated this system on real work. Data from composition tasks across multiple organisations is reported in the Southampton ePrints paper. The findings that matter:

The Corroborator detects fabrication in 52% of drafts — cases where confident-sounding claims lack evidence
Iterative feedback through the pipeline produces 79% quality improvement in argumentative quality as assessed by the Critic
Under impossible task constraints (where no correct answer exists), the system progresses from attempted fabrication toward honest refusal — collective behaviour that was neither explicitly instructed nor individually incentivised

These figures represent operational data from live systems, not controlled laboratory conditions. The methodology and detailed breakdown appear in the full paper.

Why This Matters for AI Users

You don’t have to trust the AI. You have to trust the structure.

This approach is independent of how good or bad the underlying model is. A less capable model (e.g., smaller, cheaper) embedded in this architecture can outperform a more capable model operating alone. The structure does the work.

This is auditable. You can inspect the pipeline, verify that information barriers are enforced, and prove that critical decisions pass through verification. This is much easier than auditing a neural network’s internal weights, which is highly technical and even the giant AI companies don’t really have a way of keeping up with how their own models work.

This is scalable. The same institutional structure — separation of duties, adversarial review, information partition — applies to policy analysis, research synthesis, operational decisions, or any task where correctness matters and you cannot afford confident hallucination.

Early Days, But Grounded in Working Code

The safety and correctness problems the AI companies are struggling with might eventually be fixed, although none of them are yet sounding particularly confident. Certainly there are many very capable people working on the problem, and it is early days in a new field. The hallucination research cited above demonstrates that the field is actively developing mitigations, and context window engineering continues to improve. Our argument is not that these problems are unsolvable in principle, but that structural intervention provides measurable benefits today, independent of ongoing improvements to underlying models.

We have ways to address all three problems that so far seem to give better results. They are grounded in testable theory and working code.

Addressing the biggest problems in AI

Addressing the Biggest Problems in AI

The Problem: Three Fundamental Failures

Why This Matters

Agentic AI is Insufficient

The PCE Approach: Institutional Structure, Not Individual Alignment

The Composition Architecture

How the Architecture Addresses Each Problem

Why This Works

Empirical Evidence

Why This Matters for AI Users

Early Days, But Grounded in Working Code

Further Reading

Addressing the Biggest Problems in AI#

The Problem: Three Fundamental Failures#

Why This Matters#

Agentic AI is Insufficient#

The PCE Approach: Institutional Structure, Not Individual Alignment#

The Composition Architecture#

How the Architecture Addresses Each Problem#

Why This Works#

Empirical Evidence#

Why This Matters for AI Users#

Early Days, But Grounded in Working Code#

Further Reading#

Addressing the Biggest Problems in AI

The Problem: Three Fundamental Failures

Why This Matters

Agentic AI is Insufficient

The PCE Approach: Institutional Structure, Not Individual Alignment

The Composition Architecture

How the Architecture Addresses Each Problem

Why This Works

Empirical Evidence

Why This Matters for AI Users

Early Days, But Grounded in Working Code

Further Reading