Slow LLMs and MCPs are hiding problems

AI is slow, and Agentic AI ↗ is even slower. I develop an MCP server ↗ that generates PDF documents, and I work with the Agentic Perseverance Composition Engine daily. Tasks that take maybe 5 microseconds on an operating system (eg, does a file called Things-to-Do exist?) can take a million times longer – between 2 and 5 seconds – because each operation requires multiple round trips to a remote LLM, often with timeouts. It’s a young, unstable stack, comparable in maturity to early MS DOS or the Apple ][. When AI gets hold of your data via an MCP server it can do interesting things, but it is not put together well.

The slowness is hiding something. The usual assumption — wait two years, the computers will be way faster — probably won’t help here. AI inference is getting faster via almost magical techniques ↗ and improving the immature engineering. Databases and operating systems learned many years ago that when speed increases, lurking contention ↗ becomes visible, and often is a new bottleneck.

The LLM/MCP ecosystem is still in the “slow enough to be simple” phase, but that isn’t going to last long.

We’ve been here before

When two-phase locking ↗ was invented, disk I/O dominated transaction time, 100 milliseconds or more per seek (today we deal in microseconds, 1000 times faster). Lock hold times were essentially free relative to mechanical latency, so nobody worried about them. Then hardware got faster and hidden contention became the new problem. In 2010, on 48-core systems Linux was achieving only 60% of linear scalability ↗ due to contention across directory caches, memory management locks, and per-inode serialisation. FreeBSD’s SMPng project had been incrementally eliminating its equivalent “Giant” lock since 2001, so the comparison was uncomfortable. Faster hardware makes the synchronisation tax worse ↗ .

Databases went through the same painful arc. MVCC was invented in 1978 ↗ and has been in production since the mid-1980s (InterBase, PostgreSQL, Oracle), but disk I/O masked its lock-contention costs. Once systems moved to in-memory processing ↗ and disk I/O was out of the picture, the result was “an even higher degree of concurrency and a higher degree of lock contention ↗ ”. The slowness had been acting as a natural throttle and nobody noticed until it was gone.

The LLM/MCP ecosystem is in the same position. The latency is throttling everything, keeping it away from the states where coordination failures would start cascading ↗ . Building a stack on top of this hidden latency is asking for apps that will need to be rewritten again and again as LLMs get more efficient.

What the current architecture isn’t solving

Most MCP setups use a single central LLM as the orchestrator. Repeated inference calls for every subtask create significant computational overhead, and the fixed context window forces full-context submission from all servers simultaneously, causing “context loss between steps and slower response times ↗ ”. On top of that, MCP tool integration imposes substantial token-processing overhead ↗ that today’s latency simply swamps.

And adding more agents doesn’t straightforwardly fix things. Shao et al. ↗ tested 180 configurations across five canonical architectures and found a consistent tool-coordination trade-off: tool-heavy tasks suffer disproportionately from multi-agent overhead under fixed budgets.

All of this is invisible in practice because the latency swamps it.

Today’s bottleneck is the serial single-planner architecture — an Amdahl’s Law ↗ problem where speeding up tool execution does nothing because planning is the slow step. The OS and database analogies describe what happens next: when the serial bottleneck is broken by multi-agent or pipelined architectures, the shared-state contention that databases and kernels already solved will appear. Both stages are coming.

What breaks as inference gets faster

Robert Miller’s 1968 study ↗ established that below roughly 100ms (milliseconds), interaction feels instantaneous to humans. The Doherty threshold ↗ (Doherty & Thadhani, 1982) later showed that below 400ms, productivity increases dramatically because neither human nor computer waits for the other. Both thresholds have held up under more recent scrutiny ↗ . This maps onto Time to First Token targets for LLM serving: under 200ms feels snappy for chat, under 100ms is expected for code completion. Current systems live well above either threshold.

The specific numbers depend on the architecture, workload, and tool-call graph, none of which I’m modelling here. But as inference latency drops — which is not the same as end-to-end latency, since an agentic workflow’s wall-clock time includes tool round-trips, network hops, and plan-step count that inference speed doesn’t touch — three categories of problem emerge in roughly this order:

First, the serial planner becomes the visible bottleneck. When inference is fast enough that tool calls dominate wall-clock time, the single-LLM orchestrator is exposed as the serialisation point. Faster responses also mean more tool calls per unit time, exhausting connection pools and making the absence of backpressure ↗ in many MCP servers a problem.

Second, multi-agent coordination needs explicit protocols. Most current multi-agent designs rely implicitly on the LLM’s sequential processing to provide ordering. As inference latency drops, that implicit ordering stops being reliable — the same transition a single-master database goes through when write throughput increases enough. Explicit coordination protocols are needed, and most current designs have none.

Third, shared context stores become hot spots. Without concurrency control, a shared context store accessed by multiple fast agents has no mechanism to resolve conflicting writes. There’s also the metastability problem: temporary overload becomes permanent because recovery attempts from multiple agents add load faster than the system can shed it. Systems tuned for today’s latency profiles can have hidden capacity that evaporates suddenly as speed increases, triggering an overload loop that prevents recovery ↗ .

Fixing all this

For immediate improvements we should start with MCP implementation. The Goose agentic interface ↗ has some easy wins on the UI side. Fortunes are spent on improved models; relatively little goes to the stack that delivers them to users.

The following are research directions being explored as the stack matures. Async tool execution is the only immediately actionable one, all the others very much in development.

MVCC for LLM context. The database solution was to keep old versions so readers don’t block writers. The equivalent here is agent-scoped snapshots of shared state, written atomically at task boundaries. Today’s context representations have no natural key-value decomposition — they are big blobs without versioning — so this is not yet implementable in a general way. CA-MCP is already exploring the problem ↗ .

Per-agent context partitioning. Linux moved from a global kernel lock to per-VMA locks ↗ . The equivalent for MCP is replacing a single shared context store with partitioned contexts owned by individual agents, merged explicitly at aggregation points. This requires context ownership to be agreed at task-decomposition time — a constraint current LLM planners don’t impose, and a design problem that needs solving before partitioning is practical.

Async tool execution. Issuing tool calls speculatively before the planner has confirmed they’re needed is the MCP equivalent of out-of-order execution. It would meaningfully reduce latency in multi-step workflows. The obstacle is that most MCP server implementations are not yet mature enough to support clean cancellation, which you need to make speculative execution safe.

Coordination-aware architecture selection. There’s already a framework ↗ for predicting when adding agents helps versus hurts, based on task decomposability and error propagation. Choosing your architecture based on task structure rather than defaulting to a single universal pattern works today.

We’ve been here before#

What the current architecture isn’t solving#

What breaks as inference gets faster#

Fixing all this#

We’ve been here before

What the current architecture isn’t solving

What breaks as inference gets faster

Fixing all this