AI is slow, and [Agentic AI]( is even slower. I develop a small MCP server, and I work with the Perseverance Composition Engine daily, and AI seems so, so slow. There’s so much waiting, and every mistake means yet more sitting around. Tasks that take milliseconds (for example, does a file called Things-to-Do exist?) can take between 2 and 5 seconds, because the big brain in the cloud is being consulted multiple times, often with timeouts. It’s a very young, unstable and unreliable stack, rather like the early days of MS DOS or the Apple ][. When AI can actually get the data from your computer via an MCP server it can do some very interesting things, but it is not very good today.
The incredible slowness is hiding something. AI inference is getting faster through magical techniques, prompt caching and more, and then the LLM/MCP ecosystem will rediscover principles that databases and operating systems learned the hard way. When speed increases, quiet, hidden contention turns into bottlenecks.
We’ve been here before
When two-phase locking was invented, disk I/O dominated transaction time, 100ms or more per seek. Lock hold times were essentially free relative to mechanical latency, so nobody worried about them. Then hardware got faster. Hidden contention exploded. In 2010, on 48-core systems Linux was achieving only 60% of linear scalability because of the big kernel lock. FreeBSD had already eliminated its equivalent “Giant” lock eight years earlier, in 2003, so the comparison was embarrassing. Amdahl’s Law tells you why throughput plateaus, but the deeper finding is nastier: synchronisation overhead itself grows with processor count. Faster hardware makes the synchronisation tax worse.
Databases went through the same painful arc. MVCC was invented in 1978] but remained a theoretical curiosity until systems moved to in-memory processing. Once disk I/O was out of the picture, the result was “an even higher degree of concurrency and a higher degree of lock contention”. The slowness had been acting as a natural throttle and nobody noticed until it was gone.
Limping along without knowing it
There’s a concept from distributed systems research called limplock: hardware that degrades silently while the cluster treats it as healthy, causing the whole system to crawl without ever triggering a failover. Current LLM systems aren’t literally failing but the effect on the system is the same: the latency is throttling everything, keeping it away from the states where coordination failures would start cascading.
And that is where MCP is today.
What the current architecture isn’t solving
Most MCP setups use a single central LLM as the orchestrator. Research on Context-Aware MCP has already identified the direct consequences: repeated inference calls for every subtask create significant computational overhead, and the fixed context window forces full-context submission from all servers simultaneously, causing “context loss between steps and slower response times”. On top of that, MCP tool integration imposes substantial token-processing overhead that today’s latency simply swamps.
And adding more agents doesn’t straightforwardly fix things. One study tested 180 configurations across five canonical architectures and found a consistent tool-coordination trade-off: tool-heavy tasks suffer disproportionately from multi-agent overhead under fixed budgets, and independent agents amplify errors 17.2× compared to 4.4× under centralised coordination.
None of this is surprising in principle. It’s just invisible in practice, because the latency is doing the throttling.
What breaks, and when
Latency thresholds are tricky to pin down precisely, but order-of-magnitude inflection points are useful. The Doherty threshold — below roughly 100ms, interaction feels instantaneous; above it, it feels like waiting — has been known since 1982 and held up under more recent scrutiny. For LLM serving specifically, this maps onto Time to First Token targets: under 200ms feels snappy for chat, under 100ms is expected for code completion. Current systems typically live well above these numbers.
When end-to-end latency drops below ~100ms, the central LLM planner becomes a clear bottleneck. Amdahl’s Law applies directly: if planning is serialised and planning is the slow step, speeding up tool execution does nothing for throughput. Faster responses also mean more tool calls per unit time, exhausting connection pools and making the absence of backpressure in most MCP implementations a real problem rather than a theoretical one.
When inter-token latency drops below ~10ms, multi-agent systems need explicit coordination protocols — but most current designs have none. They rely implicitly on the LLM’s sequential processing to provide ordering. That’s going to fail in exactly the way a single-master database fails when write throughput increases enough.
Below ~1ms, shared context stores become hot spots. Without concurrency control, shared context becomes a global lock — the MCP equivalent of the BKL. There’s also the metastability problem: systems tuned for today’s latency profiles can have hidden capacity that evaporates suddenly as speed increases, triggering an overload loop that prevents recovery.
What needs to be built
MVCC for LLM context. The database solution was to keep old versions so readers don’t block writers. The equivalent here is agent-scoped snapshots of shared state, written atomically at task boundaries. The hard prerequisite is structured context formats — monolithic context representations have no natural key-value decomposition, so you can’t do versioning on them. CA-MCP is already exploring this.
Per-agent context partitioning. Linux moved from a global kernel lock to per-VMA locks. The equivalent for MCP is replacing a single shared context store with partitioned contexts owned by individual agents, merged explicitly at aggregation points. This requires context ownership to be agreed at task-decomposition time — a constraint current LLM planners simply don’t impose.
Async tool execution. Issuing tool calls speculatively before the planner has confirmed they’re needed is the MCP equivalent of out-of-order execution. It would meaningfully reduce latency in multi-step workflows. The obstacle is that most MCP server implementations don’t support clean cancellation, which you need to make speculative execution safe.
Coordination-aware architecture selection. There’s already a framework for predicting when adding agents helps versus hurts, based on task decomposability and error propagation (R²=0.52 cross-validated). Using this at design time — choosing your architecture based on task structure rather than defaulting to a single universal pattern — is something you could do today.
Conclusion
Slowness is a form of implicit coordination. Remove it, and you need to replace it with something explicit or things collapse. Databases learned this. Operating systems learned this. Distributed systems learned this, repeatedly and painfully.
The LLM/MCP ecosystem is still in the “slow enough to be simple” phase, but that isn’t going to last long.