Skip to content
Reference

Voice AI latency budget: where the milliseconds actually go

  • Heads of Ops
  • Procurement / IT-Sec
  • CX directors
By Lewis CrookPublished
Bottom line up front

Latency does not degrade evenly. It collapses one step at a time, and the step is usually retrieval, not the model. This is the per-step budget and the diagnostic that finds the regression in minutes, not weeks.

The end-to-end budget — 1.0 second p95 target

A defensible production target is 1.0 second p95 end-to-end turn latency under realistic load. The budget below allocates that across six steps and leaves headroom for the unpredictable.

Per-step latency budget (p95, milliseconds)
StepBudget (ms)Stretch (ms)Diagnostic question
VAD + endpointing120200Is endpointing triggered by silence or by semantic completion?
ASR (final)180300Is the final transcript streaming or wait-for-complete?
Retrieval150350How many calls is the retrieval making, and against what indexes?
LLM inference350600First-token latency or full response? Streaming to TTS?
TTS (first audio)150250Streaming TTS or full-buffer synthesis?
Telephony egress50100Carrier route stable? p99 jitter measured?
Total p9510001800Add 5–10% network overhead headroom

Where regressions usually come from

In production, latency regressions cluster in a small set of causes. The diagnostic order below catches most within an hour.

  1. Retrieval — usually a new index or a misconfigured cache. Check call count per turn and per-call latency before suspecting the model.
  2. LLM provider — silent model version change, or a regional capacity event. Check the provider status page and your own per-region p95 first.
  3. TTS — voice changes, or a fallback to non-streaming synthesis under load. Check whether streaming TTS is still active end-to-end.
  4. Telephony — carrier route change, or a new media gateway. Check p99 jitter and packet loss before suspecting any application step.
  5. ASR — usually stable; if it moves, check whether the model version was changed or whether the language pack was updated.

Barge-in and the latency budget

Barge-in adds a separate budget: from the moment the caller starts speaking over the AI to the moment the AI stops outputting audio, the target is under 250ms p95. Above 500ms and barge-in reads as the AI ignoring the caller, which is worse than no barge-in at all.

What to measure weekly

  • End-to-end p95 turn latency under realistic load — not single-threaded demo
  • Per-step p95 latency for ASR, retrieval, LLM, TTS, telephony
  • Barge-in p95 (caller speech onset to AI audio stop)
  • p99 jitter and packet loss on the telephony leg
  • Latency outliers (above the stretch budget) tagged by step and root cause
Do this on Monday

Pull yesterday's p95 latency per step from your observability tool. If you cannot produce a per-step breakdown by lunch, that is the procurement gap to close before the next intent ships.

Key takeaways
  • Target 1.0 second p95 end-to-end turn latency under realistic load — not mean, not single-threaded.
  • Retrieval is the most common regression cause; LLM provider is the most common cause executives blame first.
  • Barge-in has a separate budget — under 250ms p95 from caller speech onset to AI audio stop.
  • Measure per-step p95 weekly, not just the end-to-end number.
  • Streaming end-to-end (ASR → LLM → TTS) is not optional at the 1.0-second target.

Frequently asked questions

Is 1.0 second p95 realistic?
Yes, but it requires streaming end-to-end (ASR final streaming into LLM, LLM streaming into TTS), disciplined retrieval design, and a carrier-grade telephony path. Most production deployments without those land at 1.4–1.8 seconds and read as slightly laggy.
Why is mean latency a misleading number?
Because callers experience tail latency, not mean. A deployment with mean 800ms and p95 2400ms feels broken on one call in twenty, and that is the call the executive hears about.
How much headroom should the budget leave?
Five to ten percent for unpredictable network and carrier variance. The stretch column in the table above is what a single bad call leg can sustain without breaking the conversation.

Terms used in this guide

  • Voice AI latencyVoice AI latency is the gap before the system starts talking back.
  • Turn-taking latencyTurn-taking latency is the awkward pause before the bot starts talking back.
  • Voice AIVoice AI is software that answers the phone, understands what the caller wants, and takes action — not just a smarter IVR.
  • Barge-inBarge-in lets the caller interrupt the bot without breaking the conversation.
Last reviewed: 2026-06-15. This guide is updated when production patterns shift; see the corrections page to flag anything that no longer matches reality.
About the author
Lewis Crook
Practitioner writer on enterprise voice AI

Lewis Crook — 20 years in enterprise technology, from FTSE 100 voice deployments to over a million AI-handled minutes a month across Asia-Pacific. Buyer, builder, and now working with CX leaders on enterprise voice AI. Writes The Voice AI Brief. Connect on LinkedIn. More about Lewis.

Field notes

Short, opinionated takes from practice that sit underneath this guide.

Newsletter
Liked this? Get the next edition.

Plus the Voice AI Readiness Diagnostic in the welcome email.

Welcome email includes the Voice AI Readiness Diagnostic. No second list, no extra form.