There are many problems with the AI billions of people use in 2026, discussed endlessly at all levels of society. From the end of 2025 I became interested in the particular problem of ethics and reliability, and the common approach taken by all of the large AI companies to solving safety and predictability.

A colleague started working on a very different approach from these companies, and from February 2026 I have been contributing to and using prototype versions of the Artificial Organisations concept. This article explains why I believe Artificial Organisations are a promising new direction in Multi-agent Agentic AI, as described here by the UK government. If you want to try them out for yourself, the core research code is available and I use several such organisations daily.

The Biggest Problems in Using AI

The Perseverance Composition Engine (PCE) uses Artificial Organisations to solve these pressing AI problems. PCE does not try to make LLMs behave better, but is designed instead so that their inevitable misbehaviour is detected and corrected. And regardless of the computer science, if you have read Iain M Banks novels or played the Mass Effect game you have met this idea before.

PCE works by assigning a task to LLM agents who each have a carefully enforced role to play. The agents iterate between each other until either the task is completed to specifications, or it fails by honestly saying “I can’t do this, the task is impossible for me.” So far, this arrangement seems effective at detecting and corrects common problems such as confident false assertions, hallucinations, or dangerous advice. With PCE, nobody needs to trust an AI, only the structure. The structure is recognisable by most people, since it is closely modelled on ones tried and tested for centuries. Like any organisation, Artificial Organisations have separation of duties, independent checks, and agents who can only see what they need to see. It works rather well.

This design addresses three failure modes that the usual training and instruction cannot fully fix: hallucination, context issues, and memory issues.

Hallucination and nonsense

Language models generate text according to probability, where the next piece of text (a ’token’) is selected based on patterns, not by retrieving facts from a database. If a model does not have a pool of highly relevant text to select from (the ‘context’) it will probabilistically generate text anyway because that it what it is programmed to do. The result is confabulation, where the model sounds confident while making a false or misleading claim. The better the AIs become at expressing themselves, the more convincing these hallucinations can become.

Research keeps concluding that training does not eliminate hallucination, and newer surveys describe hallucinations as potentially “fundamental mathematical inevitabilities inherent to [the model’s] architecture.” The AI companies are trying to solve this by giving better instruction and training, but if hallucination is indeed inevitable then this will never be reliable. I am persuaded the architecture needs to change for AI to become more trustworthy.

Context input to a model is called the ‘prior’. A quality prior comprises the best available documents, previous relevant decisions, germane background, and from it AI generates much better output. Just like a human organisation, Artificial Organisations strive to deal with the best quality input documents in order to improve decisionmaking, and to carefully label or even reject guesswork. This is the first structural way we can tackle hallucinations.

A second technique is also familiar: have someone else check the work. PCE has an agent called the Corroborator whose only job is to read what the Composer agent wrote, and to verify every claim against the source documents. The Corroborator has the sources right in front of it, so if the Composer invented a claim, the Corroborator will see it is unsupported. Corroborator is unmoved by plausible confabulation, because it is instructed to only accept what can be proven from the sources to hand, including references on the internet if it has been instructed to do so.

Context indexing is poor

Within any given AI conversation or session, the AI’s Context Window is the largest amount of information it can consider at once, measured in tokens of a few characters. The most advanced AI models available in 2026 with million+ token context windows (eg Llama 4 Scout, Kimi 3.0, GPT-5.4, Claude Opus 4.6) can hold maybe 1000 pages of text at once, or about *one millionth the storage capacity of a modern phone. An AI is doing a lot more with that text than a phone can possibly do, but this total limit is currently a major cause of why AI is unreliable. Even worse, models become more distractable and error-prone as their context window fills up close to its maximum. When their context window is full they lose information, miss obvious connections, and fail to incorporate relevant material because they can’t find what they need to build a good prior on their output. If AIs had a better index of the things they already know, then their output would be more reliable.

This is a retrieval problem and we know human organisations are good at finding information if we put it is kept in a library or a database. In contrast, asking a model to reconstruct context from scratch is asking it to guess (and it will). PCE solves this with a persistent knowledge base: documents are stored permanently, version-controlled, and indexed for full-text search. When a new task arrives, the system retrieves prior work rather than asking the model to start from scratch. The Curator is the agent responsible for this institutional memory, and it files, indexes, and makes everything findable. The model operates from plentiful, correct context retrieved from this database. Such a database is not a new idea, but our use of it fits neatly alongside other familiar concepts from physical organisations, inheriting all the well-known behaviours of information that passes from the library department to other departments, or to members of the public etc.

AI is amnesiac and lacks self-awareness

Between AI conversations, an AI has forgotten everything that has happened since it was trained. An AI can’t build on prior work, track decisions already made, or correct accumulated errors across an organisation’s history. Considered as merely a computer, AI is like something from the 1980s, where after power-on you need to instruct it what it knows and what it can do. Similarly with an AI, and that is what happens with every chatbot before you can even type in a request to it. If the chatbot appears to remember what you were discussing last week then that means a lot of work has been done behind the scenes to load the stored information back into the AI context window so you can continue chatting seamlessly. Part of this boot-up work is dedicated to instructing the AI chatbot to be more reliable, not to say dangerous things etc. In multi-agent agentic AI, that means each agent must be booted up with its initial context containing instructions and data relevant to its task.

More subtly, agents lack awareness of their own state — what tools are available, what time it is, whether the context recently filled up and needed to be compacted down into a summary. Without that awareness an agent can’t reason effectively about what it knows and doesn’t know.

The rigidity of Artificial Organisations is computer code that starts up, instructs and monitors each agent, even though the controlling brain of the organisation is yet another agent. This computer code is akin to the rules and laws that define how we expect human organisations to behave, for example, an organisation has a CEO, but there are still limits on what the CEO can do because the legal system cares about enforcing the rules not keeping CEOs happy.

The persistent knowledgebase of Artificial Organisations also addresses this problem of amnesia between AI conversations. Prior decisions, drafts, reasoning, and verification results are preserved and searchable across sessions and between agents. Each composition task leaves a structured trace that future work can build on, so the system knows what has been decided and can reason about its own state. Some recent agentic systems (Claude Code, Goose) have begun adding memory across conversations, but the difference is that PCE’s memory is structured, curated, and searchable — an institutional record with semantic structure, not a bag of remembered facts.

Agentic AI doesn’t do permissions

The mainstream response to AI’s problems is agentic AI, connecting multiple AI agents together, often with access to tools, web search, or document retrieval. This is an improvement over a single model call like a chatbot conversation.

The limitation of existing agentic approaches is that agents share context freely, so the biases and errors of one agent propagate to the next. The same agent that retrieves sources also writes claims and evaluates its own output. There is no structural guarantee that the work is independently checked, so fabrication in the synthesis step passes straight through. Cooperation is the default; adversarial review is not.

PCE takes a different approach: structure first, capability second.

The composition architecture

The Perseverance Composition Engine implements this organisational logic in code. Context is not shared freely any more than we expect medical records to be available to all workers in a hospital. The standard PCE workflow involves five agents. Tasks can be delegated to individual agents or through the full PCE pipeline depending on requirements.

The Composer drafts text from source materials. It reads the documents and writes coherent prose, but does not see the evaluation criteria — like an author who writes the report but doesn’t set the exam questions.

The Corroborator fact-checks independently, reading both the sources and the draft. Its single task is to verify every claim and catch fabrication before it reaches the reader. The Composer and Corroborator both see the source documents, but the Corroborator is instructed only to accept what the sources support. If the Composer invented something, the Corroborator will see the gap.

The Critic evaluates quality and safety. It reads the draft and the evaluation rubrics, but not the sources. This sounds backwards, but when reviewers see the sources they tend toward lazy evaluation, assuming claims are right because the sources seem to support them. Blind review, familiar from academic peer review, forces the Critic to evaluate what’s actually on the page.

The Censor checks appropriateness for the intended audience. A document can be factually correct and well-argued but contextually wrong for its recipient. We once had a job application letter that was accurate and well-written, but mentioned a private research programme that was inappropriate for that particular employer. The Corroborator passed it (true), the Critic passed it (well-argued), and the Censor caught the mismatch between content and audience.

The Curator publishes and maintains institutional memory, filing, indexing, and making work discoverable to future agents and users. The Curator is the librarian of the organisation, maintaining the structured record of what has been done, who decided it, and why.

No single agent can see the complete picture, so no single agent can rationalise away problems. The Composer can’t excuse unsourced claims by pointing to the sources (it doesn’t see them). The Critic can’t say “the sources must support this” (it hasn’t seen them). The Censor can’t be pressured by arguments about truth or quality — it only evaluates fit. Output cannot advance without passing through these verification gates.

Read our research on this approach.

The status today

PCE users don’t have to trust the AI, just the structure.

This approach is independent of how good or bad the underlying model is. A less capable, cheaper model embedded in this architecture can outperform a more capable model operating alone, because the cheaper model iterates around the organisational loop and the structure does the work. You can inspect the pipeline, verify that information barriers are enforced, and confirm that decisions pass through verification, which is much easier than auditing a neural network’s internal weights. And the same organisational structure applies to policy analysis, research synthesis, operational decisions, or any task where correctness matters.

Here is summary data from composition tasks across multiple organisations, reported in the Southampton ePrints paper:

  • The Corroborator detects fabrication in 52% of drafts — cases where confident-sounding claims lack evidence
  • Iterative feedback through the pipeline produces 79% quality improvement in argumentative quality as assessed by the Critic
  • Under impossible task constraints (where no correct answer exists), the system progresses from attempted fabrication toward honest refusal — collective behaviour that was neither explicitly instructed nor individually incentivised

These figures represent operational data from live systems. The methodology and detailed breakdown appear in the full paper.

The hallucination research cited above demonstrates that the field is actively developing mitigations, and context window engineering continues to improve. Our argument is that structural intervention provides measurable benefits today, independent of ongoing improvements to underlying models.


Further Reading