Slow LLMs and MCPs are hiding problems |

This is a technical note about a problem that is going to bite agentic AI users soon.

AI is slow, and Agentic AI is even slower. I develop an MCP server that generates PDF documents, and I work with the Agentic Perseverance Composition Engine daily, and AI seems so, so slow. There’s so much waiting, and every mistake means yet more sitting around. Tasks we know actually take maybe 5 microseconds on an operating system (eg, does a file called Things-to-Do exist?) can take one million time longer – between 2 and 5 seconds. This is because the big brain in the cloud is being consulted multiple times, often with timeouts. It’s a young, unstable and unreliable stack, rather like the early days of MS DOS or the Apple ][. When AI gets hold of the data from your computer via an MCP server it can do some very interesting things, but it is not put together well.

The slowness is hiding something, the usual idea of “wait two years, the computers will be way faster” probably won’t apply. AI inference is getting faster via almost magical techniques and improving the immature engineering. Databases and operating systems learned many years ago that when speed increases, lurking contention becomes visible, and often is a new bottleneck.

The LLM/MCP ecosystem is still in the “slow enough to be simple” phase, but that isn’t going to last long.

We’ve been here before

When two-phase locking was invented, disk I/O dominated transaction time, 100 milliseconds or more per seek (today we deal in microseconds, 1000 times faster). Lock hold times were essentially free relative to mechanical latency, so nobody worried about them. Then hardware got faster and hidden contention became the new problem. In 2010, on 48-core systems Linux was achieving only 60% of linear scalability because of the big kernel lock. FreeBSD had already eliminated its equivalent “Giant” lock eight years earlier, in 2003, so the comparison was embarrassing. Faster hardware makes the synchronisation tax worse.

Databases went through the same painful arc. MVCC was invented in 1978] but remained a theoretical curiosity until systems moved to in-memory processing. Once disk I/O was out of the picture, the result was “an even higher degree of concurrency and a higher degree of lock contention”. The slowness had been acting as a natural throttle and nobody noticed until it was gone.

Limping along without knowing it

There’s a concept from distributed systems research called limplock. This is hardware that degrades silently while the cluster treats it as healthy, causing the whole system to crawl without ever triggering a failover. Current LLM systems aren’t literally failing but the effect on the system is the same. The latency is throttling everything, keeping it away from the states where coordination failures would start cascading.

And that is where MCP is today. I find it helps to consider what to use instead of MCP, because building a stack on top of this hidden latency is asking for apps that will need to be rewritten again and again as LLMs get more efficient.

What the current architecture isn’t solving

Most MCP setups use a single central LLM as the orchestrator. Research on Context-Aware MCP has already identified the direct consequences: repeated inference calls for every subtask create significant computational overhead, and the fixed context window forces full-context submission from all servers simultaneously, causing “context loss between steps and slower response times”. On top of that, MCP tool integration imposes substantial token-processing overhead that today’s latency simply swamps.

And adding more agents doesn’t straightforwardly fix things. One study tested 180 configurations across five canonical architectures and found a consistent tool-coordination trade-off: tool-heavy tasks suffer disproportionately from multi-agent overhead under fixed budgets, and independent agents amplify errors 17.2× compared to 4.4× under centralised coordination.

None of this is surprising in principle. It’s just invisible in practice due to the very high latency.

What breaks, and when it may break

Latency thresholds are tricky to pin down precisely, but order-of-magnitude inflection points are useful. The Doherty threshold from 1982 says that below roughly 100ms (milliseconds), interaction feels instantaneous to humans; above it, it feels like waiting. This has held up under more recent scrutiny. For LLM serving specifically, this maps onto Time to First Token targets: under 200ms feels snappy for chat, under 100ms is expected for code completion. Current systems typically live well above these numbers which is why I say they are so very slow.

When end-to-end latency drops below ~100ms, the central LLM planner will likely be the bottleneck. Amdahl’s Law of scaling speedups suggests that if planning is serialised and planning is the slow step, speeding up tool execution does nothing. Faster responses also mean more tool calls per unit time, exhausting connection pools and making the absence of backpressure in many MCP servers a problem.

When inter-token latency drops below ~10ms, multi-agent systems will need explicit coordination protocols — but most current designs have none. They rely implicitly on the LLM’s sequential processing to provide ordering. That’s going to fail in a similar way a single-master database fails when write throughput increases enough.

Below ~1ms, shared context stores become hot spots. Without concurrency control, shared context becomes a global lock — the MCP equivalent of the BKL. There’s also the metastability problem: systems tuned for today’s latency profiles can have hidden capacity that evaporates suddenly as speed increases, triggering an overload loop that prevents recovery.

Fixing all this

For immediate improvements before the big-picture problems are solved, I feel we should start with MCP implementation. There are also user interfaces, and there I see the Goose agentic interface has some potential easy wins. But this is what the industry needs to sort out, because fortunes are spent on improved models but relatively little on how they deliver the benefit to users.

Following are the more fundamental fixes:

MVCC for LLM context. The database solution was to keep old versions so readers don’t block writers. The equivalent here is agent-scoped snapshots of shared state, written atomically at task boundaries. The trouble is that todays context representations have no natural key-value decomposition, they are just big blobs, so you can’t do versioning on them. CA-MCP is already exploring this.

Per-agent context partitioning. Linux moved from a global kernel lock to per-VMA locks. The equivalent for MCP is replacing a single shared context store with partitioned contexts owned by individual agents, merged explicitly at aggregation points. This requires context ownership to be agreed at task-decomposition time — a constraint current LLM planners simply don’t impose.

Async tool execution. Issuing tool calls speculatively before the planner has confirmed they’re needed is the MCP equivalent of out-of-order execution. It would meaningfully reduce latency in multi-step workflows. The obstacle is that many MCP server implementations don’t support clean cancellation, which you need to make speculative execution safe. This is MS DOS sophistication, remember!

Coordination-aware architecture selection. There’s already a framework for predicting when adding agents helps versus hurts, based on task decomposability and error propagation. Choosing your architecture based on task structure rather than defaulting to a single universal pattern is doable today.

We’ve been here before#

Limping along without knowing it#

What the current architecture isn’t solving#

What breaks, and when it may break#

Fixing all this#

We’ve been here before

Limping along without knowing it

What the current architecture isn’t solving

What breaks, and when it may break

Fixing all this