Glossary

Voice AI and contact centre glossary

Short, single-topic definitions of the terms that show up most often in enterprise voice AI procurement, evaluation, and operations.

What is voice AI?
Voice AI is a class of conversational AI that handles spoken telephone interactions end-to-end. It combines speech-to-text, a language model, and text-to-speech with telephony and integration layers so it can listen, understand intent, take action against systems of record, and respond in natural speech.
What is containment rate?
Containment rate is the share of calls handled end-to-end by an automated system — usually voice AI or an IVR — without escalation to a human agent. Formally: contained calls divided by total in-scope calls, over a defined time window.
What is agentic voice?
Agentic voice refers to voice AI systems that plan and execute multi-step actions across tools and systems of record during a single call, rather than answering single-turn questions. The defining property is autonomous tool use under conversational control.
What is autonomous resolution rate?
Autonomous resolution rate is the share of calls fully resolved by an AI system without human involvement and without the customer re-contacting for the same intent within a defined window (typically 7 days). It is a stricter alternative to containment rate.
What is voice AI latency?
Voice AI latency is the end-to-end delay between the caller finishing speaking and the AI beginning to respond. It combines speech-to-text, language model inference, text-to-speech, and any integration calls on the critical path.
What is IVR replacement?
IVR replacement is the migration from a touch-tone or directed-dialogue IVR to a voice AI system that handles open-ended speech and can take action against systems of record. It is rarely a like-for-like swap — the new system absorbs flows the IVR used to route away.
What is barge-in?
Barge-in is the ability for a caller to interrupt a voice AI mid-utterance and have the system stop speaking, listen, and respond to the interruption naturally. Without barge-in, the agent has to finish every sentence before the caller can react, which collapses the perceived realism of the interaction.
What is turn-taking latency?
Turn-taking latency is the delay between the caller finishing speaking and the voice AI recognising the turn has ended and beginning to respond. It combines end-of-turn detection, speech-to-text finalisation, language model inference, and text-to-speech start time. It is the most-felt component of perceived voice AI quality.
What is intent recognition?
Intent recognition is the process by which a voice AI identifies what the caller is trying to achieve, mapping open speech to a structured intent the system can act on. Modern LLM-driven voice AI often handles this with prompting rather than a separate classifier, but the function — turning ambiguous speech into a routable intent — is the same.
What is DTMF fallback?
DTMF (dual-tone multi-frequency) fallback is the design pattern of capturing sensitive input — card numbers, PINs, account numbers — through keypad tones rather than speech, so the voice AI, recording layer, and underlying model never hear the digits. It is the standard PCI-safe capture pattern for voice AI deployments.
What is voice biometrics?
Voice biometrics is the use of a caller's unique voice characteristics to verify identity. Modern implementations are usually passive — running in the background during the conversation — and combined with knowledge or device factors to meet step-up authentication requirements.
What is real-time transcription?
Real-time transcription is the streaming conversion of spoken audio to text with low enough latency that downstream systems — voice AI, agent assist, supervisor dashboards, compliance flags — can act on it during the call rather than after it. It is the input layer of every voice AI system.
What is deflection rate?
Deflection rate is the share of inbound contacts moved out of the live-agent queue into an automated or asynchronous channel — voice AI, SMS, chat, web self-service, or proactive notification. Gross deflection counts every deflected contact; net deflection subtracts contacts that returned within a defined window for the same intent.
What is first call resolution?
First call resolution (FCR) is the share of customer contacts fully resolved in the first interaction without a follow-up for the same issue within a defined window — typically 7 days. It is the human-channel ancestor of autonomous resolution rate.
What is average handle time?
Average handle time (AHT) is the mean total time an agent spends per contact — talk time plus hold time plus after-call work — typically reported in seconds. It is the dominant productivity metric in voice contact centres and the most-gamed.
What is after-call work?
After-call work (ACW) is the time an agent spends completing a contact after the caller has disconnected — notes, case updates, transfers, follow-ups, and any required compliance logging. It is the most under-measured contributor to AHT.
What is escalation rate?
Escalation rate is the share of calls handled by an automated system that hand off to a human agent before resolution. It is the inverse of containment rate. Escalation reasons — captured per call — are the single richest input to a voice-AI operating model.
What is end-of-turn detection?
End-of-turn detection is the mechanism by which a voice AI decides the caller has finished speaking and it should begin to respond. It combines voice activity detection, semantic completion signals, and timing heuristics. It is usually the largest single contributor to turn-taking latency.
What is hallucination rate in voice AI?
Hallucination rate is the share of voice AI utterances that contain a confident statement unsupported by retrieved evidence or current system state — a confidently wrong answer. It is measured per turn or per call and is the most consequential safety metric in regulated deployments.
What is voice AI orchestration?
Voice AI orchestration is the layer that coordinates speech-to-text, language-model inference, text-to-speech, tool calls into systems of record, telephony events, and fallback paths into a single coherent call flow. It is the integration substrate that distinguishes a demo from a production-grade deployment.
What is SIP trunking?
SIP trunking is the delivery of voice calls between an enterprise and a telephony provider over an IP-based signalling protocol (Session Initiation Protocol). It is the substrate every voice AI deployment rides on, and the layer at which residency, recording, and DTMF capture decisions are made.
What is voice cloning?
Voice cloning is the synthesis of a custom voice — typically based on samples from a brand actor, voice talent, or executive — for use as the voice AI's text-to-speech output. It combines voice identity (timbre, accent) with prosody control (pace, intonation, emotion).
What is prompt injection in voice AI?
Prompt injection in voice AI is a spoken or transcribed attempt to override the agent's instructions, exfiltrate data, or escalate privilege through manipulated dialogue. It is the voice-channel equivalent of the text prompt-injection attack surface and is harder to detect because audio carries fewer attacker fingerprints.
What is conversational design?
Conversational design is the discipline of shaping voice and chat AI dialogue — turn structure, persona, error recovery, confirmation patterns, escalation language — so the system produces measurable CX outcomes rather than merely accurate responses. It sits between product design, linguistics, and CX operations.
What are LLM guardrails?
LLM guardrails are the policy and runtime controls that constrain what a language model can say, do, and disclose during a conversation. They include topic restrictions, refusal patterns, tool-call scoping, output validators, and the safety layer that catches violations before they reach the caller.
What is automated-system disclosure?
Automated-system disclosure is the obligation to tell a caller they are interacting with an automated system rather than a human. It is required or expected in most major regulatory regimes and is distinct from recording consent — combining the two is the most common audit finding in early voice-AI deployments.
What is voice AI evaluation?
Voice AI evaluation is the structured process of comparing voice AI platforms or deployments against measurable production criteria — integration depth, latency, observability, operating-model fit, safety, control surface, voice quality, telephony reach, and commercial model — rather than demo quality.
What is voice AI ROI?
Voice AI ROI is the measured return on a voice AI programme, expressed defensibly as cost per resolved call against the pre-AI baseline. It includes the operating-model cost (conversation owner, platform owner, observability tooling) and subtracts re-contact within a defined window for the same intent.