An LLM backend that survives concurrency
A backend that "survives concurrency" is making a tail-latency claim, not an average one. Here is the staged pipeline I run behind an AI answer box, and the three boring guardrails that keep the slow tail from eating everyone.
When someone says a backend "survives concurrency," they don't mean it's fast on average. They mean the slow tail doesn't eat everyone else. That distinction decides the whole design.
The work itself is a pipeline, not a prompt. A question comes in; it leaves as a grounded, polished answer through four stages — each one staged precisely so a failure in one doesn't smear into the next.
Staging isn't decoration. Intent detection feeds slot extraction, and "errors made by the intent detector propagate to the slot filler" — a known pipeline hazard. So I detect intents once, then fan out one extractor per intent concurrently: a slow or weird slot can't stall the others, and the detector's mistake doesn't get re-derived four times.
Then the part that actually decides whether it survives load — three guardrails, none of them clever, all of them mandatory:
Second: a semaphore is not a rate limiter. LLM and tool APIs enforce requests-per-minute and tokens-per-minute at once, so one fifty-slot semaphore happily lets fifty big prompts blow the token ceiling and cascade into 429s. You need two layers — a request semaphore for RPM headroom and a token bucket for rolling TPM — plus full-jitter backoff so retries don't synchronize into a thundering herd.
Third: a deadline is a budget you pass down, not a single timeout. Allocate a global budget per request and spend it across nested scopes — per-tool, per-fan-out, per-turn, workflow — logging what's left on every hop. That's the difference between a deadline that's real and one that's aspirational; a tight global budget is how you hold p99 near the budget instead of near infinity.
Underneath all of it, the single biggest lever is at the serving layer: a global concurrency cap with continuous batching. It lifts GPU utilization from ~35% to ~80% and 2–4× throughput, and — the part that matters — it tightens the tail, because long sequences stop blocking short ones. Chunked prefill is the trick that makes long inputs survivable: it cuts p95 time-to-first-token by ~68% while p50 barely moves.
p50-flat, p95-cut is the literal signature of surviving concurrency — you didn't get faster on average, you stopped the tail from punishing everyone. None of this lives in a framework, which is the last decision: the business core — staging, the limiter, the budget — is framework-agnostic, because the frameworks get concurrency wrong, and a thin adapter is easy to replace. The part that survives load should outlive whatever's wrapping it this year.