Is 1.0 second p95 realistic?

Yes, but it requires streaming end-to-end (ASR final streaming into LLM, LLM streaming into TTS), disciplined retrieval design, and a carrier-grade telephony path. Most production deployments without those land at 1.4–1.8 seconds and read as slightly laggy.

Why is mean latency a misleading number?

Because callers experience tail latency, not mean. A deployment with mean 800ms and p95 2400ms feels broken on one call in twenty, and that is the call the executive hears about.

How much headroom should the budget leave?

Five to ten percent for unpredictable network and carrier variance. The stretch column in the table above is what a single bad call leg can sustain without breaking the conversation.

Reference

Voice AI latency budget: where the milliseconds actually go

Heads of Ops
Procurement / IT-Sec
CX directors

By Lewis CrookPublished June 15, 2026

Bottom line up front

Latency does not degrade evenly. It collapses one step at a time, and the step is usually retrieval, not the model. This is the per-step budget and the diagnostic that finds the regression in minutes, not weeks.

The end-to-end budget — 1.0 second p95 target

A defensible production target is 1.0 second p95 end-to-end turn latency under realistic load. The budget below allocates that across six steps and leaves headroom for the unpredictable.

Per-step latency budget (p95, milliseconds)

Step	Budget (ms)	Stretch (ms)	Diagnostic question
VAD + endpointing	120	200	Is endpointing triggered by silence or by semantic completion?
ASR (final)	180	300	Is the final transcript streaming or wait-for-complete?
Retrieval	150	350	How many calls is the retrieval making, and against what indexes?
LLM inference	350	600	First-token latency or full response? Streaming to TTS?
TTS (first audio)	150	250	Streaming TTS or full-buffer synthesis?
Telephony egress	50	100	Carrier route stable? p99 jitter measured?
Total p95	1000	1800	Add 5–10% network overhead headroom

Where regressions usually come from

In production, latency regressions cluster in a small set of causes. The diagnostic order below catches most within an hour.

Retrieval — usually a new index or a misconfigured cache. Check call count per turn and per-call latency before suspecting the model.
LLM provider — silent model version change, or a regional capacity event. Check the provider status page and your own per-region p95 first.
TTS — voice changes, or a fallback to non-streaming synthesis under load. Check whether streaming TTS is still active end-to-end.
Telephony — carrier route change, or a new media gateway. Check p99 jitter and packet loss before suspecting any application step.
ASR — usually stable; if it moves, check whether the model version was changed or whether the language pack was updated.

Barge-in and the latency budget

Barge-in adds a separate budget: from the moment the caller starts speaking over the AI to the moment the AI stops outputting audio, the target is under 250ms p95. Above 500ms and barge-in reads as the AI ignoring the caller, which is worse than no barge-in at all.

What to measure weekly

End-to-end p95 turn latency under realistic load — not single-threaded demo
Per-step p95 latency for ASR, retrieval, LLM, TTS, telephony
Barge-in p95 (caller speech onset to AI audio stop)
p99 jitter and packet loss on the telephony leg
Latency outliers (above the stretch budget) tagged by step and root cause

Do this on Monday

Pull yesterday's p95 latency per step from your observability tool. If you cannot produce a per-step breakdown by lunch, that is the procurement gap to close before the next intent ships.

Key takeaways

Target 1.0 second p95 end-to-end turn latency under realistic load — not mean, not single-threaded.
Retrieval is the most common regression cause; LLM provider is the most common cause executives blame first.
Barge-in has a separate budget — under 250ms p95 from caller speech onset to AI audio stop.
Measure per-step p95 weekly, not just the end-to-end number.
Streaming end-to-end (ASR → LLM → TTS) is not optional at the 1.0-second target.

Frequently asked questions

Is 1.0 second p95 realistic?: Yes, but it requires streaming end-to-end (ASR final streaming into LLM, LLM streaming into TTS), disciplined retrieval design, and a carrier-grade telephony path. Most production deployments without those land at 1.4–1.8 seconds and read as slightly laggy.
Why is mean latency a misleading number?: Because callers experience tail latency, not mean. A deployment with mean 800ms and p95 2400ms feels broken on one call in twenty, and that is the call the executive hears about.
How much headroom should the budget leave?: Five to ten percent for unpredictable network and carrier variance. The stretch column in the table above is what a single bad call leg can sustain without breaking the conversation.

Terms used in this guide

Voice AI latency— Voice AI latency is the gap before the system starts talking back.
Turn-taking latency— Turn-taking latency is the awkward pause before the bot starts talking back.
Voice AI— Voice AI is software that answers the phone, understands what the caller wants, and takes action — not just a smarter IVR.
Barge-in— Barge-in lets the caller interrupt the bot without breaking the conversation.

Last reviewed: 2026-06-15. This guide is updated when production patterns shift; see the corrections page to flag anything that no longer matches reality.

About the author

Lewis Crook

Practitioner writer on enterprise voice AI

Lewis Crook — 20 years in enterprise technology, from FTSE 100 voice deployments to over a million AI-handled minutes a month across Asia-Pacific. Buyer, builder, and now working with CX leaders on enterprise voice AI. Writes The Voice AI Brief. Connect on LinkedIn. More about Lewis.

Field notes

Short, opinionated takes from practice that sit underneath this guide.

Latency budgets and why 800 ms breaks the illusion
Conversational realism collapses above roughly 800 ms of turn-taking latency. A note on where the milliseconds go and which optimisations actually move the dial.

Newsletter

Liked this? Get the next edition.

Plus the Voice AI Readiness Diagnostic in the welcome email.

Welcome email includes the Voice AI Readiness Diagnostic. No second list, no extra form.