live2026-05-22·2 min read

An LLM backend that survives concurrency

A backend that "survives concurrency" is making a tail-latency claim, not an average one. Here is the staged pipeline I run behind an AI answer box, and the three boring guardrails that keep the slow tail from eating everyone.

When someone says a backend "survives concurrency," they don't mean it's fast on average. They mean the slow tail doesn't eat everyone else. That distinction decides the whole design.

The work itself is a pipeline, not a prompt. A question comes in; it leaves as a grounded, polished answer through four stages — each one staged precisely so a failure in one doesn't smear into the next.

pre-search
grounding · answer from evidence, not priors
intent
detect N intents in one utterance
route
fan-out tools · concurrency-capped
polish
CoT refine once
Fig 1 — four stages. Ground first, detect intents once, fan extraction and tools out in parallel, refine once at the end.

Staging isn't decoration. Intent detection feeds slot extraction, and "errors made by the intent detector propagate to the slot filler" — a known pipeline hazard. So I detect intents once, then fan out one extractor per intent concurrently: a slow or weird slot can't stall the others, and the detector's mistake doesn't get re-derived four times.

Then the part that actually decides whether it survives load — three guardrails, none of them clever, all of them mandatory:

Second: a semaphore is not a rate limiter. LLM and tool APIs enforce requests-per-minute and tokens-per-minute at once, so one fifty-slot semaphore happily lets fifty big prompts blow the token ceiling and cascade into 429s. You need two layers — a request semaphore for RPM headroom and a token bucket for rolling TPM — plus full-jitter backoff so retries don't synchronize into a thundering herd.

Third: a deadline is a budget you pass down, not a single timeout. Allocate a global budget per request and spend it across nested scopes — per-tool, per-fan-out, per-turn, workflow — logging what's left on every hop. That's the difference between a deadline that's real and one that's aspirational; a tight global budget is how you hold p99 near the budget instead of near infinity.

Underneath all of it, the single biggest lever is at the serving layer: a global concurrency cap with continuous batching. It lifts GPU utilization from ~35% to ~80% and 2–4× throughput, and — the part that matters — it tightens the tail, because long sequences stop blocking short ones. Chunked prefill is the trick that makes long inputs survivable: it cuts p95 time-to-first-token by ~68% while p50 barely moves.

−68%p95 TTFT from chunked prefill (p50 flat)
2–4×throughput from continuous batching

p50-flat, p95-cut is the literal signature of surviving concurrency — you didn't get faster on average, you stopped the tail from punishing everyone. None of this lives in a framework, which is the last decision: the business core — staging, the limiter, the budget — is framework-agnostic, because the frameworks get concurrency wrong, and a thin adapter is easy to replace. The part that survives load should outlive whatever's wrapping it this year.