Skip to content
Note

Latency budgets and why 800 ms breaks the illusion

By Lewis Crook ·

The most common voice AI complaint after launch is not that the agent gets things wrong — it is that the agent feels off. Almost always, the agent feels off because turn-taking latency has crept past the threshold where conversational realism holds.

Roughly 800 milliseconds is the gap human speakers tolerate between turns without noticing. Past that, the listener registers a pause; past 1.5 seconds, the listener starts to fill the silence; past 2 seconds, the listener concludes the system has not understood and begins again. Each of those behaviours inflates handle time and inflates the perceived error rate even when the model is right.

Most deployments spend the budget on integration calls placed on the critical path. A tool call to a CRM that takes 600 milliseconds eats most of the budget on its own. Moving that call off the critical path — speculating, caching, prefetching, or running it in parallel with TTS — is the single highest-leverage latency optimisation in most stacks.

Streaming end-to-end matters too. Streaming speech-to-text into a streaming language model into streaming text-to-speech compresses the budget because the system can start speaking before the model has finished. Non-streaming stacks rarely clear 1.2 seconds in production. Streaming stacks routinely clear 800.

Vendor-neutral note. See the editorial policy and the independence disclosure.