# Lewis Crook

> Practitioner-grade analysis of enterprise voice AI and conversational AI for contact centres — from someone who has been the buyer, builder, and in the room.

## Independence and affiliation

I work as Lead Solutions Engineer, UK & Ireland at Parloa. This site is my own — Parloa does not sponsor, fund, commission, or review anything published here.

The frameworks and evaluation criteria are vendor-neutral by design. No vendor (including Parloa) is named, ranked, recommended, or criticised in any guide, glossary entry, framework, or note. Views are my own.

If you spot anything that reads as vendor-favourable or vendor-hostile, that is a content bug — please flag it via the corrections page.

---

# Pillar guides

## Voice AI vs legacy IVR: the honest unit economics

URL: /guides/voice-ai-vs-ivr-unit-economics
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Voice AI is cheaper than a live agent and more expensive than a touch-tone IVR. The real question is not cost per call but cost per resolved call — and most vendor ROI models quietly assume containment rates that production deployments do not hit.

### What does a voice AI call actually cost?

A voice AI call carries four cost lines that an IVR does not: speech-to-text, large language model inference, text-to-speech, and telephony. As of 2026, fully loaded per-minute costs for an enterprise-grade stack typically sit in the range of a few cents to low double-digit cents per minute, depending on model choice and how aggressively prompts and retrieval are cached.

A legacy IVR, by contrast, has effectively zero variable cost per call once licensed — its expense is the maintenance burden on the team that owns the call flows.

On a like-for-like minute basis, voice AI is therefore more expensive than IVR and roughly an order of magnitude cheaper than a live agent. That is the easy part of the comparison.

### Why cost per call is the wrong unit

The number that actually matters is cost per resolved call — total platform and telephony spend divided by the number of calls the system handled end-to-end without escalating to a human.

This collapses the comparison. A voice AI deployment with a 25% containment rate at $0.30 per call has a cost per resolved call of $1.20. The same deployment at 55% containment has a cost per resolved call of $0.55. Vendor ROI models almost always quote the second number and project benefits from the first.

### What does a real-world unit-economics model look like?

A defensible model has at least four inputs that vendor decks usually compress into one: measured containment rate, average handle time on contained calls, average handle time on escalated calls (which is often longer than a non-AI call because the customer has already explained the problem once), and the fully-loaded cost of the human agent who picks up the escalation.

- Measured containment rate from a representative call sample, not a curated demo set
- Average handle time on contained vs escalated calls
- Re-contact rate — calls that come back within 7 days for the same issue
- Fully loaded agent cost including supervision, QA, and attrition

### Where the real ROI usually shows up

In most enterprise call centre automation programmes, the largest economic lever is not labour replacement on the contained portion — it is reducing average handle time and re-contact rate on the calls that still escalate. A voice AI that captures intent, identity, and verification before transfer can take 30–90 seconds off a human-handled call. At enterprise volumes, that line item often exceeds the savings from containment.

### Note on terminology

This piece uses US spelling for technical terms ("call center automation"). Most UK and ANZ readers know the same category as call centre automation; the unit economics do not change.

### Key takeaways

- Voice AI is roughly an order of magnitude cheaper per call than a live agent, and more expensive per minute than an IVR.
- The right unit is cost per resolved call, not cost per call or per minute.
- Vendor ROI models quietly assume containment rates production rarely hits — model your own measured rate.
- The largest economic lever is usually AHT reduction on calls that still escalate, not labour replacement on contained calls.
- Subtract 7-day re-contact for the same intent — a contained call that returns has not been resolved.

### FAQs

**Is voice AI cheaper than a live agent?**

On a per-call basis, yes — fully loaded voice AI costs are roughly an order of magnitude lower than a live human agent. The relevant comparison, however, is cost per resolved call, which depends on the measured containment rate of the specific deployment.

**Is voice AI cheaper than an IVR?**

No. A legacy IVR has effectively zero variable cost per call. Voice AI adds speech-to-text, LLM, and text-to-speech costs that an IVR does not carry. Voice AI wins on resolution rate and customer experience, not on raw per-minute cost.

**What is a realistic containment rate to model?**

Containment varies sharply by use case. Account-balance and order-status calls regularly reach 60–80% containment; complex billing or claims calls more often sit in the 15–35% band in production. Model your specific intent mix, not a blended vendor average.

**What is usually missing from vendor ROI models?**

Three things: measured rather than projected containment, the handle-time penalty on escalated calls when the customer has to re-explain, and re-contact rate within 7 days. All three move the cost per resolved call materially.

---

## How to evaluate enterprise voice AI platforms: a vendor-neutral framework

URL: /guides/how-to-evaluate-voice-ai-platforms
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A defensible enterprise voice AI evaluation rates nine dimensions, not three. Most procurement decisions go wrong by over-weighting demo quality and under-weighting integration depth, observability, and the operating model required to keep the agent useful after launch.

### The nine evaluation dimensions

These are the dimensions that consistently predict whether a deployment survives its first year of production. They apply equally to voice bot for call center / call centre use cases and to broader conversational AI deployments.

- Integration depth — read/write access to systems of record, not just a webhook surface
- Latency — first-token and end-to-end, measured under realistic load
- Control surface — how prompts, flows, and guardrails are authored and versioned
- Operating model fit — who maintains it, with what tooling, on what cadence
- Observability — call-level transcripts, intent labels, escalation reasons, drift signals
- Safety and compliance — PII handling, recording, jurisdictional residency
- Voice quality — naturalness, barge-in handling, interruption recovery
- Telephony and channel reach — SIP, contact centre platform integrations, omnichannel
- Commercial model — per-minute, per-resolution, or platform, and what it does at 10x volume

### What to actually test in a proof of value

Three tests separate platforms more reliably than any feature checklist: a representative call sample replayed end-to-end, an integration test against the systems of record the deployment will actually use, and a maintenance simulation in which a non-engineer attempts to change a flow and verify the change in production.

Demo calls curated by the vendor are useful only as a baseline; they do not predict production behaviour on your call mix.

### Common scoring mistakes

Three patterns recur in enterprise procurement. First, weighting voice quality at 30%+ when in production the difference between platforms on that axis is small and rapidly narrowing. Second, scoring "integrations" by counting logos rather than measuring read/write depth. Third, deferring the operating-model question to implementation, by which point the choice is locked in.

### A note for UK and EU buyers

UK and EU contact centre buyers should add data residency, recording consent, and DPIA support to the scoring rubric as gate criteria rather than weighted dimensions. A platform that fails residency is not a lower-scoring option — it is out of consideration.

### Key takeaways

- Score across nine dimensions, not three — integration depth, latency, control surface, operating-model fit, observability, safety, voice quality, telephony reach, and commercial model.
- Three tests separate platforms reliably: representative call replay, integration test against real systems of record, and a non-engineer change simulation.
- Demo quality is the most over-weighted axis in enterprise procurement.
- Integration depth — measured by read/write capability, not logo count — is the most under-weighted axis.
- Defer the operating-model question and the choice gets made for you by week six of implementation.

### FAQs

**How long should a voice AI proof of value take?**

Six to ten weeks is typical for a defensible evaluation: two weeks to build the call sample and integration test, four to six weeks of running, and one to two weeks to analyse results against measured baselines.

**What is the most under-weighted evaluation dimension?**

Observability. Platforms vary widely in what you can see at call level after launch — intent labels, escalation reasons, drift signals — and that visibility is what allows the operating-model team to improve the agent over time.

**Should we evaluate against an internal build option?**

Usually yes, at least as a reference. Even when an internal build is not the chosen path, costing it out clarifies which parts of the vendor offering are genuinely difficult to replicate and which are convenience.

**How do we compare per-minute and per-resolution pricing?**

Convert both to cost per resolved call using your modelled containment rate, then stress-test at 0.5x and 2x that rate. Per-resolution pricing transfers containment risk to the vendor, which often makes it the more defensible choice for early deployments.

---

## Voice AI containment rate: what's real vs what vendors claim

URL: /guides/voice-ai-containment-rate-reality
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Containment rate is the single most-quoted and least-defined number in enterprise voice AI. Vendor figures of 70%+ are not wrong, but they usually measure a narrower denominator than the one a CX leader cares about.

### What containment rate measures

Containment rate is the share of calls handled end-to-end by the automated system without escalation to a human agent. It is sometimes called deflection rate, automation rate, or self-service rate; the term varies by vendor and platform.

The number depends entirely on two definitions: the numerator (what counts as "handled") and the denominator (which calls are in scope). Two deployments quoting 65% containment can differ by 30 percentage points on a like-for-like comparison.

### Where the definitions diverge

The most common adjustments that inflate a reported containment rate are: excluding calls that hung up in the first 10 seconds, excluding calls routed to a human at the IVR before the AI was offered, excluding out-of-hours calls, and counting any call that did not transfer as "contained" even if the customer called back the next day.

- Numerator: does "handled" require the customer's stated intent to be resolved, or only that no transfer occurred?
- Denominator: are abandoned, out-of-hours, and pre-routed calls included?
- Time window: is re-contact within 7 or 14 days deducted from the numerator?
- Intent scope: is containment measured across all calls or only intents the AI is configured to handle?

### A defensible measurement

A measurement defensible to a finance team and a regulator typically includes all inbound calls in scope, requires evidence of resolution (a fulfilled action or an explicit confirmation), and subtracts 7-day re-contact for the same intent. Numbers calculated this way are usually 15–30 percentage points lower than the vendor headline.

### What a healthy production range looks like

Across the deployments I have seen, defensible containment on a representative call mix sits in three bands. Transactional intents (balance, status, simple changes) routinely reach 60–80%. Mixed intents (billing questions, account changes) tend to land at 30–50%. Complex intents (claims, disputes, retention) typically remain under 30% unless heavily redesigned. Blended figures depend on the intent mix.

### Key takeaways

- Containment rate is the most-cited and most loosely-defined metric in voice AI procurement.
- Vendor headlines and defensibly measured rates typically differ by 15–30 percentage points.
- A defensible measurement requires evidence of resolution and subtracts 7-day re-contact for the same intent.
- Transactional intents commonly reach 60–80%; complex enterprise mixes more often sit at 25–45%.
- Compare against your own baseline, not a blended vendor average.

### FAQs

**What is a good voice AI containment rate?**

There is no single good number — it depends entirely on intent mix. A blended 35% on a complex enterprise call mix can be a stronger result than a blended 65% on a transactional mix. Compare against your own baseline, not a vendor average.

**How is containment rate different from deflection rate?**

The terms are used interchangeably by most vendors. Where they differ, deflection usually refers to calls prevented from reaching the queue at all, and containment to calls that entered the AI flow and were not escalated. Always check the specific definition in a vendor proposal.

**Should re-contact be subtracted from containment?**

Yes, for any internal measurement. A call that is "contained" today but returns tomorrow for the same intent has not been resolved; counting it as containment overstates the system's effectiveness.

**Why are vendor containment rates often higher than measured?**

Most vendor figures exclude short abandons, out-of-hours, and out-of-scope intents, and do not subtract re-contact. Each adjustment is individually defensible; together they typically lift the reported figure by 15–30 percentage points.

---

## Why enterprise voice AI pilots fail to reach production

URL: /guides/why-enterprise-voice-ai-pilots-fail
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Most enterprise voice AI pilots that stall do so for the same five reasons, and none of them are model quality. They are integration depth, operating model, measurement, scope creep, and stakeholder alignment.

### Reason 1 — integration depth was treated as a phase-two problem

A pilot scoped against a read-only API surface produces a demo that cannot be productionised. Genuine resolution requires write access into systems of record (CRM, billing, claims), and that integration work is the longest pole in the tent. Pilots that defer it almost always discover, at the production gate, that the platform cannot do what the business case assumed.

### Reason 2 — no one owns the agent after go-live

Voice AI is not a deploy-and-forget system. It needs weekly attention: reviewing failed calls, updating intents, adjusting guardrails. Pilots that did not name an operating-model owner before launch tend to drift in the first quarter and lose stakeholder confidence before the metrics can recover.

### Reason 3 — measurement was negotiated late

If success criteria are agreed after the pilot has started, the pilot will be judged against whichever metric currently looks worst. Agreeing on containment definition, baseline, and a single primary metric before launch is the cheapest insurance available.

### Reason 4 — scope expanded during the pilot

A pilot that started with three intents and ended with eleven has not been evaluated; it has been redesigned. Lock the intent list at the start and capture additions as backlog for a phase-two scope.

### Reason 5 — the contact-centre operations team was a spectator

Pilots championed by transformation or innovation teams without the contact-centre operations team as an equal partner consistently struggle at the production handover. The team that will live with the agent must own it from week one.

### Key takeaways

- Pilots stall for five repeating reasons, and none are model quality.
- Integration depth treated as phase-two is the single most common failure.
- No named operating-model owner before launch is the second.
- Success criteria negotiated mid-pilot guarantee an indecisive result.
- Pilots without contact-centre operations as an equal partner rarely survive handover.

### FAQs

**What is the most common reason a voice AI pilot stalls?**

Integration depth — the gap between what the platform can read from the systems of record during a demo and what it needs to write into them to actually resolve a call.

**How long should an enterprise voice AI pilot run?**

Eight to twelve weeks in production traffic, with a defined go/no-go decision at the end. Open-ended pilots almost always become permanent pilots.

**Who should own a voice AI pilot internally?**

The contact-centre operations function, with transformation or AI as a co-sponsor. Pilots owned exclusively by transformation rarely survive handover.

**Should success criteria be agreed before or during the pilot?**

Before. Containment definition, baseline, primary metric, and decision rule should all be written down before launch. Negotiating them mid-pilot is the most common path to an indecisive result.

---

## Who maintains a voice AI after go-live? The operating-model question

URL: /guides/voice-ai-operating-model
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A voice AI without a named owner, a weekly review cadence, and a non-engineer change path will degrade within a quarter. The operating model is not a phase-two consideration; it is what determines whether the launch holds.

### The three roles that need to exist

Most successful enterprise deployments converge on the same three roles, regardless of whether they sit in a single team or are split across operations, engineering, and CX.

- Conversation owner — reviews failed calls, owns intent and prompt changes, typically from CX or operations
- Platform owner — owns integrations, model selection, observability, typically from engineering
- Business owner — owns the metric, the roadmap, and the escalation path, typically from contact-centre leadership

### What weekly looks like

A workable weekly cadence reviews a sample of escalated calls, tags failure modes, updates intents or guardrails for the top two failure modes, and ships those changes through a controlled release path. An hour of disciplined review per week prevents most drift; skipping it for a quarter is what creates the "the AI got worse" perception.

### The non-engineer change path

If every prompt change requires an engineering ticket, the operating model collapses under its own latency. Platforms differ widely in what a conversation owner can change without code; this is one of the highest-leverage axes in vendor evaluation and one of the most under-rated.

### Key takeaways

- A voice AI without a named owner, a weekly review cadence, and a non-engineer change path will degrade within a quarter.
- Three roles need to exist — conversation owner, platform owner, business owner.
- Plan for 0.5–1.0 FTE on the conversation-owner role per high-volume deployment.
- If every prompt change needs an engineering ticket, the operating model collapses under its own latency.
- Skipping the weekly review for a quarter is what creates the 'the AI got worse' perception.

### FAQs

**How much ongoing effort does a production voice AI need?**

Plan for 0.5–1.0 FTE on the conversation-owner role for a single high-volume deployment, plus engineering on-call for integrations. Skipping this is the most common cause of post-launch performance decay.

**Should the operating model sit in IT or in CX operations?**

Conversation ownership belongs in CX or operations; platform ownership belongs in engineering. Putting both in IT tends to slow the conversation loop; putting both in CX tends to leave platform health unattended.

**What is the most common operating-model failure?**

No weekly review cadence. Without it, failure modes accumulate, stakeholder confidence erodes, and the next budget cycle is spent defending the deployment instead of extending it.

---

## Enterprise voice AI integration depth: a real evaluation checklist

URL: /guides/voice-ai-integration-depth-checklist
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Integration depth is the single biggest predictor of whether a voice AI can actually resolve calls. "We integrate with Salesforce" is not a meaningful claim until you can see what the platform reads, what it writes, and how it handles failure.

### Read vs write — the distinction that matters

Most platforms can read from a CRM. Far fewer can write to one safely, with the right authentication, idempotency, and audit trail. Resolution — actually doing the thing the customer called about — requires write access. A read-only deployment can answer questions; it cannot fix problems.

### The integration checklist

- Identification and authentication — can the platform verify the caller against your identity provider, not just match a phone number?
- Read access — to customer record, account state, recent interactions, scheduled events
- Write access — create cases, update preferences, schedule callbacks, process payments where in scope
- Idempotency — can a retried call avoid double-charging or double-booking?
- Failure handling — when an integration call fails, does the agent degrade gracefully or hallucinate?
- Observability — is every integration call logged with request, response, latency, and outcome?
- Compliance — PCI, HIPAA, GDPR/UK GDPR handling on every read and write path
- Latency budget — integrations on the critical path should resolve within the conversational latency budget

### The questions that catch over-claims

Three questions usually expose the gap between marketing and capability: show me a call where the AI created a record in our system of record; show me what happens when that write fails; show me how a conversation owner inspects that failure the next morning. If the answers require an engineer, the operating model will not scale.

### Key takeaways

- Read access is common; write access — what actually resolves calls — is much rarer.
- Identity and authentication is the most common integration gap, limiting deployments to low-risk intents.
- Idempotency and failure handling decide whether a retried call double-charges or hallucinates.
- Every integration call on the critical path eats into a sub-1.5-second latency budget.
- If inspecting a failed integration requires an engineer, the operating model will not scale.

### FAQs

**Why does write access matter so much for voice AI?**

Because resolution requires action, not just answers. A platform that can read a customer's balance but not process their payment, schedule their callback, or update their preference is offering self-service lookup, not call resolution.

**What is the most common integration gap in enterprise voice AI?**

Identity and authentication. Many platforms can match a phone number but cannot perform strong customer authentication against the enterprise identity provider — which limits the deployment to low-risk intents.

**How important is integration latency?**

Critical. Each integration call on the critical path eats into the conversational latency budget — typically under 1.5 seconds end-to-end before perceived quality drops. Slow integrations force either a degraded experience or a narrower scope.

---

## Call deflection with AI: where it works and where it backfires

URL: /guides/call-deflection-ai
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Call deflection works when the deflected channel can actually resolve the intent. When it can't, deflection just relocates the call — often into a more expensive channel a day later.

### What call deflection actually means

Call deflection covers anything that moves a call out of the live-agent queue: AI voice containment, SMS deflection, chat hand-off, IVR self-service. The economic case for each rests on the same question — does the deflected channel resolve the intent, or does it postpone it?

### Which intents respond well to deflection

Transactional intents with a clear success state — balance enquiry, order status, appointment confirmation, simple changes — deflect cleanly because resolution is unambiguous and the customer can verify it themselves.

Intents with emotional or financial weight, or with multi-step exceptions, often deflect on the headline metric but generate re-contact within days. Net deflection on those intents is frequently negative when properly measured.

### The re-contact test

The honest measure of deflection is net deflection rate: deflected calls minus calls that returned within a defined window (typically 7 days) for the same intent. Gross deflection flatters; net deflection sometimes flips the sign of the business case entirely.

### Where deflection backfires

Two patterns recur. First, deflecting complex intents into a channel that cannot resolve them produces a worse experience and a more expensive eventual resolution. Second, deflecting calls that the customer escalated to specifically because they wanted a human creates measurable churn risk in high-value segments.

### Key takeaways

- Deflection works when the deflected channel can actually resolve the intent — otherwise it just relocates the call.
- Net deflection rate (deflected minus 7-day re-contact) often runs 20–40% below gross.
- Transactional intents with a clear success state deflect cleanly; emotional or multi-step intents often deflect on headline but generate re-contact.
- Deflecting calls that customers escalated specifically to reach a human creates measurable churn risk in high-value segments.
- Always measure net, not gross — finance will.

### FAQs

**Is call deflection the same as containment?**

Closely related but not identical. Containment usually refers to calls that entered the AI flow and were not escalated; deflection often includes calls prevented from reaching the queue at all (via SMS, chat, or proactive outreach).

**What is net deflection rate?**

Calls deflected minus calls that returned within a defined window for the same intent. It is the only deflection number that survives finance review, and it is often 20–40% lower than gross deflection.

**When should call deflection not be used?**

For intents the deflected channel cannot fully resolve, for high-value segments that have explicitly requested human handling, and for any intent where a re-contact carries regulatory or churn risk.

---

## Customer service automation: an honest guide for enterprise CX leaders

URL: /guides/customer-service-automation
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Customer service automation is mature where intents are transactional and immature where intents are emotional. The hardest part of the strategy is deciding what not to automate.

### Where automation is mature in 2026

Identity verification, balance and status enquiries, simple updates, appointment management, password and access flows, and order tracking are routinely automated to high containment in production today. These intents share three properties: the success state is unambiguous, the data is in a single system of record, and the customer's emotional load is low.

### Where automation is still hard

Disputes, claims, complex billing, retention conversations, and any intent involving regulated advice remain difficult. The technical components exist, but the combination of multi-system orchestration, judgment, and emotional context makes durable automation rare without significant process redesign.

### The deliberate "do not automate" list

A mature automation strategy includes an explicit list of intents that will be routed to humans by design, with the rationale documented. This protects the customer experience and the business case — automation that should not have been attempted is the most expensive kind.

### Channel choice is part of the strategy

Voice AI, chat AI, and async messaging automation are not interchangeable. The right channel for an intent depends on customer preference, regulatory constraints, and the realistic resolution rate in each. Strategies that pick a single channel and force every intent into it tend to underperform a thought-out mix.

### Key takeaways

- Customer service automation is broader than voice AI — it spans chat, messaging, voice, and proactive outbound.
- Channel-shift without intent-fit produces worse outcomes than no automation.
- Measure net resolution rate and CSAT on escalated calls — escalated-call CSAT is the early warning that scope has been pushed too far.
- Chat automation is more capability-mature; voice has caught up sharply with modern LLM stacks.
- Pick the channel by intent, not by vendor convenience.

### FAQs

**What customer service intents should not be automated?**

Intents that require regulated advice, intents tied to retention or churn risk, and intents where the customer's emotional state is the primary signal. Automating any of these tends to convert a service problem into a brand problem.

**How should automation be measured beyond containment?**

Net resolution rate, re-contact, CSAT on contained calls, and a separate measurement of CSAT on escalated calls. The escalated-call CSAT is often the early signal that automation scope has been pushed too far.

**Is voice or chat automation more mature?**

Chat automation is more mature in pure capability terms; voice automation has caught up sharply with modern LLM-driven stacks. The right answer for a given enterprise depends on the customer's preferred channel for the specific intent.

---

## Conversational AI vs voice AI: what's the actual difference?

URL: /guides/conversational-ai-vs-voice-ai
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Conversational AI is the umbrella — any AI that holds a multi-turn dialogue in text or speech. Voice AI is the spoken-telephony subset. The architectural, latency, and operating-model constraints are sharply different, and conflating them is one of the most common procurement mistakes.

### The two categories in one paragraph

Conversational AI covers any AI system that holds a multi-turn dialogue: web chatbots, in-app assistants, messaging bots, and voice. Voice AI is the subset that handles spoken telephone or in-app voice interactions, with the additional constraints of telephony, real-time latency, naturalness, barge-in, and end-of-turn detection.

Every voice AI is a conversational AI; not every conversational AI is voice AI. The distinction matters because the constraints that decide whether a voice AI works in production — latency budget, audio quality, telephony integration — are absent in text-only conversational AI.

### What changes when you move from text to voice

The model layer can look identical. The surrounding system does not.

- Latency budget compresses from 3–5 seconds (text) to under 1.5 seconds (voice) before perceived quality drops
- End-of-turn detection becomes a hard problem — there is no submit button
- Audio quality, codec, and packet loss become part of the failure surface
- Barge-in handling — the caller interrupting mid-sentence — has to be designed in, not retrofitted
- Telephony adds SIP, contact-centre platform integration, recording, and consent that text never faced
- Error recovery happens out loud, so it has to be conversational, not modal

### Where the operating model diverges

Text conversational AI is usually owned by digital or product teams; voice AI almost always ends up co-owned with contact-centre operations because the failure modes spill into the live-agent queue. The conversation-owner role for voice carries a heavier ongoing review burden because audio failure modes are harder to spot in a dashboard than text ones.

### Should an enterprise standardise on one platform for both?

Tempting, rarely correct. The same vendor often handles one channel materially better than the other, and the operational owners differ. Standardising on a shared conversation design language — intents, guardrails, escalation rules — is high-leverage; standardising on a single runtime usually is not.

### Key takeaways

- Conversational AI is the umbrella term; voice AI is the spoken-telephony subset.
- Latency, audio quality, barge-in, and telephony are voice-only constraints — they decide whether a voice deployment works in production.
- The operating-model owner for voice usually lives in contact-centre operations, not in digital or product.
- Standardising on shared conversation-design language across channels works; standardising on a single runtime usually does not.
- Voice prompts are not chat prompts — reusing them is the most common cause of unnatural voice AI.

### FAQs

**Is voice AI a type of conversational AI?**

Yes. Conversational AI is the umbrella term for any multi-turn AI dialogue system; voice AI is the spoken-telephony subset, with additional latency, audio, and telephony constraints.

**Is conversational AI just chatbots?**

No. Conversational AI covers chat, messaging, in-app assistants, and voice. Older chatbots were rule-based; modern conversational AI is LLM-driven and capable of multi-step action.

**Can the same platform run both voice and chat?**

Many platforms market both. In practice, vendors usually do one materially better than the other, and the operating-model owners differ between channels. Evaluate the channels separately even if you standardise on a vendor.

**Which is harder to deploy, voice or chat?**

Voice. The latency budget is tighter, the failure surface includes telephony and audio, and end-of-turn detection has no equivalent of a submit button. Chat is technically easier; voice is operationally heavier.

**Does the same prompt work for voice and chat?**

Rarely. Voice prompts have to be shorter, avoid bullet lists and code blocks, and account for ambiguity in spoken intent. Reusing chat prompts on voice is the most common cause of unnatural-sounding voice AI.

---

## Voice AI security and compliance: the enterprise buyer's checklist

URL: /guides/voice-ai-security-compliance
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Voice AI security is not a model problem — it is a data-flow problem. The questions that decide whether a deployment is approvable concern where audio, transcripts, and PII travel; what the model provider retains; how recording consent is captured; and whether the deployment survives a regulator's data-flow diagram.

### The four data flows that decide approval

Most security reviews collapse into four flows. Get clean answers on each, in writing, before procurement closes.

- Audio in transit — codec, encryption, routing path, geographic transit
- Audio at rest — recording storage, retention window, encryption at rest, deletion guarantee
- Transcript and prompt — where it is stored, who can see it, whether it is used for model training
- PII in tool calls — what is sent to systems of record, what is masked or tokenised before reaching the model

### The compliance regimes that show up most often

Different regimes care about different parts of the stack; a single "compliance" answer rarely covers them all.

- PCI DSS — applies the moment card data enters the audio stream; usually requires pause-and-resume or DTMF capture so the model never hears the digits
- HIPAA — applies to PHI in healthcare contexts; requires BAAs across every data-handling vendor including model providers and recording storage
- GDPR / UK GDPR — lawful basis, recording consent, data subject rights, transfer mechanisms outside the UK/EEA, DPIA artefacts
- FCA / financial-services rules — call recording retention, vulnerable-customer handling, fair treatment evidence
- Sector-specific — telco lawful intercept, insurance complaint logging, public-sector accessibility duties

### Data residency — the question that breaks most pilots

Many voice AI platforms advertise multi-region deployment but route inference, fine-tuning, or evaluation through a single region. UK and EU buyers should ask, in writing: where is each of speech-to-text, language-model inference, text-to-speech, and recording storage physically processed and stored, for every call. "Available in EU" is not the same as "runs in EU end-to-end."

### Recording consent and the voice AI exception

A voice AI announcement is not automatically a recording disclosure. Many jurisdictions require both — that the caller is informed they are speaking with an automated system and, separately, that the call is recorded. Conflating the two is the most common consent failure surfaced by post-deployment audits.

### Model-provider data handling

The single highest-leverage clause to negotiate is whether the underlying model provider retains, logs, or trains on the audio, transcripts, or tool-call payloads. Default settings on hosted model APIs frequently allow some form of retention; enterprise tenancies typically do not. The voice AI platform usually controls this — but only if the buyer asks.

### The questions that catch over-claims

Three questions consistently expose marketing gaps: show me a data-flow diagram for one complete call including every third party; show me where PCI-relevant data is masked, and what proves it; show me the retention and deletion policy for audio, transcript, and tool-call logs separately. If any of those answers is verbal-only, the deployment is not approvable.

### Key takeaways

- Voice AI security is a data-flow problem, not a model problem.
- Four flows decide approval: audio in transit, audio at rest, transcript/prompt handling, PII in tool calls.
- PCI usually requires pause-and-resume or DTMF capture so the model never hears card digits.
- Data residency claims are routinely overstated — get a per-component, per-call written data-flow.
- Automated-system disclosure is not the same as recording consent — most jurisdictions require both.

### FAQs

**Is voice AI PCI compliant?**

The platform is not — the deployment is. PCI compliance depends on whether card data ever enters the audio stream the model hears. The standard pattern is pause-and-resume or DTMF capture, with the digits routed to a PCI-scoped service the model never sees.

**Is voice AI GDPR compliant?**

GDPR compliance depends on lawful basis, transfer mechanism, retention, and consent — none of which the platform decides on its own. Treat "GDPR-compliant" as a starting position, then walk through the data-flow per use case.

**Where is voice AI data processed?**

It varies by platform and by call. Ask for a written, per-component data-flow: speech-to-text, language-model inference, text-to-speech, recording. "Available in your region" is not the same as "runs in your region."

**Does the model provider retain my call data?**

By default, often yes — many hosted model APIs retain inputs for abuse monitoring or evaluation. Enterprise tenancies typically allow zero retention, but only if the voice AI platform passes the right flags. Confirm in writing.

**What does a defensible recording-consent script look like?**

Two distinct disclosures: that the caller is interacting with an automated system, and that the call is being recorded — with the lawful basis, retention, and opt-out path. Conflating the two is the most common audit finding.

**Do I need a DPIA for a voice AI deployment?**

In UK/EU contexts, almost always yes. Voice AI involves automated processing of personal data, often at scale, often touching special-category data. A DPIA is the cheapest insurance you will buy on the programme.

---

## Conversational IVR / IVR replacement: the phased migration playbook

URL: /guides/ivr-replacement-playbook
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Successful IVR replacements are phased, not big-bang. Migrate one intent cluster at a time, run voice AI and the legacy IVR in parallel until each cluster clears its gate, and never remove the IVR as a disaster-recovery path in year one.

### Why big-bang IVR replacement fails

Big-bang cutovers concentrate every risk — integration, intent mix, model behaviour, operating model — into a single weekend. When something regresses, the blast radius is the entire inbound queue, and the only rollback is reverting the SIP routing, which loses the AI's learning to date.

Phased replacement separates those risks so each can be measured and reversed independently.

### The five-phase migration

These phases are intentionally boring. The goal is to remove drama from the cutover, not to demonstrate speed.

- Phase 1 — intent triage: rank existing IVR intents by volume, complexity, and resolution probability; pick three to five for the first wave.
- Phase 2 — parallel running: route the wave's intents to voice AI; leave everything else on the IVR; keep both reachable from the same number.
- Phase 3 — measured cutover gate: each intent clears a written gate (containment, AHT, CSAT, re-contact) before it counts as replaced.
- Phase 4 — incremental migration: add one intent cluster per sprint; never carry an open regression into the next wave.
- Phase 5 — IVR as DR: keep the legacy IVR warm and tested as the documented fallback path through year one.

### The cutover gate that protects CX

Each intent migrates only when it clears four numbers measured on production traffic against the IVR baseline: containment within 5 points of plan, AHT no worse than the IVR baseline, CSAT within margin of error, and 7-day re-contact no higher than baseline. Failing any one returns the intent to the IVR until the next sprint.

### Fallback design that nobody regrets

Every intent in scope should have a one-click documented fallback to the legacy IVR or a live queue. The fallback is not just for outages — it is for the calls the AI handled poorly, the new intents that arrived unannounced, and the regulatory edge cases that need a human. Designing the fallback after launch is the single most common production-grade gap.

### Key takeaways

- Phased migration beats big-bang every time — one intent cluster at a time.
- Each intent clears a written gate (containment, AHT, CSAT, re-contact) before it counts as replaced.
- Run voice AI and the IVR in parallel; keep the IVR warm as DR through year one.
- Fallback design after launch is the single most common production gap.
- Save authentication-heavy and emotional intents for later waves, not the first.

### FAQs

**How long does an enterprise IVR replacement actually take?**

Six to eighteen months for a multi-intent enterprise contact centre. Pilots that promise 'six weeks to replace the IVR' usually replace a single intent cluster, not the IVR.

**Should the legacy IVR be decommissioned in year one?**

Almost never. Keeping the IVR as a warm disaster-recovery path costs little and is the documented fallback for outages, regulatory edge cases, and intents the AI is not yet configured for.

**Which intent should be migrated first?**

A high-volume transactional intent with a clear success state and a forgiving failure mode — typically balance enquiry, order status, or appointment confirmation. Save authentication-heavy and emotional intents for later waves.

**What if the cutover gate is missed?**

Roll the intent back to the IVR, fix the failure mode in the next sprint, and re-attempt the gate. Promoting an intent that missed the gate is how production-grade deployments become whispered cautionary tales.

---

## Voice AI pricing models: per-minute, per-resolution, and platform compared

URL: /guides/voice-ai-pricing-models
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Three pricing models dominate enterprise voice AI: per-minute, per-resolution, and platform. Each transfers risk differently. Convert all three to cost per resolved call before comparing — the headline rate almost never wins.

### Per-minute pricing

Per-minute is the simplest model and the easiest to compare on a rate card. It transfers no containment risk to the vendor — the buyer pays for every minute the AI is on the line regardless of outcome.

Typical 2026 enterprise per-minute pricing for a fully managed voice AI stack lands in the range of $0.10 to $0.35 per minute, with model choice and telephony bundling the largest swings. Self-hosted or build-it-yourself stacks routinely run lower per minute and higher per total cost of ownership.

### Per-resolution pricing

Per-resolution pricing charges only for calls the AI resolves end-to-end, with a written definition of 'resolved' attached to the contract. It transfers containment risk to the vendor — which both aligns incentives and forces the contract to define resolution before signing.

Typical 2026 enterprise per-resolution pricing lands in the $1.00 to $4.00 range, depending on call complexity and what 'resolved' includes. The model is most defensible when the buyer is early in its voice-AI journey and the containment rate is genuinely uncertain.

### Platform pricing

Platform pricing decouples cost from per-call volume. The buyer pays a fixed annual platform fee plus a usage component that is small relative to the platform fee. The model favours buyers running at predictable, high volume and tends to disadvantage low-volume or seasonal deployments.

### Cost per resolved call — the only fair comparison

Convert each model to cost per resolved call using your modelled containment rate, then stress-test at 0.5x and 2x that rate. Three things almost always come out of that exercise: per-minute looks best at the assumed containment and worst at half of it; per-resolution is the inverse; platform pricing wins decisively at high volume and loses badly at low volume. The right answer is almost never the headline.

### Hidden costs that move the answer

Five line items move the comparison materially and are almost never on the rate card: telephony pass-through, model API charges if not bundled, integration / connector fees, professional services for the operating model, and the change order for any custom voice or guardrail work. A 30 to 60% uplift on the headline number is normal once these are included.

### Key takeaways

- Three models dominate: per-minute, per-resolution, and platform.
- Per-minute transfers no containment risk; per-resolution transfers it to the vendor; platform decouples cost from per-call volume.
- Convert all three to cost per resolved call at your modelled containment, then stress-test at 0.5x and 2x.
- Five hidden costs (telephony, model API, integration, services, change orders) routinely add 30–60% to the headline.
- Operating-model labour — usually 0.5–1.0 FTE — is the most under-modelled cost in vendor proposals.

### FAQs

**Which voice AI pricing model is cheapest?**

None universally. Per-minute is cheapest at high measured containment, per-resolution is cheapest at uncertain or low containment, and platform pricing is cheapest at high predictable volume. Convert to cost per resolved call before deciding.

**How is 'resolved' defined in per-resolution pricing?**

It varies by contract and is the single most negotiated definition in voice-AI procurement. The defensible definition requires evidence of resolution (a fulfilled action or explicit caller confirmation) and subtracts re-contact within a defined window for the same intent.

**Do vendors negotiate pricing models?**

Yes, especially for enterprise commitments. Hybrid models — per-resolution with a per-minute floor, or platform with a per-resolution overage — are common in 2026 enterprise contracts.

**What is the most under-modelled cost?**

Operating-model labour. A 0.5 to 1.0 FTE conversation owner plus engineering on-call is rarely in the vendor proposal and is the most common reason real cost diverges from rate-card cost.

---

## Call deflection benchmarks: realistic 2026 numbers by intent and channel

URL: /guides/call-deflection-benchmarks
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Realistic 2026 call deflection lands in three bands by intent type. Vendor-headline gross rates routinely run 20 to 40 points above the net rate finance will accept. Baseline against your own IVR and queue, not a vendor average.

### Why benchmarks need three dimensions

A single deflection benchmark is meaningless without three qualifiers: the intent type, the channel doing the deflection, and whether the number is gross or net of 7-day re-contact for the same intent. The same deployment can quote 65% or 25% depending on which combination is used.

### Realistic 2026 bands by intent

Across enterprise deployments seen in 2025 and 2026, defensibly measured net deflection clusters in three bands.

- Transactional intents (balance, status, appointment confirmation, simple changes): net 50–75%
- Mixed intents (billing questions, account changes, basic claims FNOL): net 25–45%
- Complex intents (disputes, retention, multi-step claims, vulnerable-customer): net 10–25%

### Realistic 2026 bands by channel

Channel choice constrains the achievable rate as tightly as intent does.

- Voice AI containment on inbound calls: 25–60% blended, intent-mix-dependent
- SMS deflection from queue: 8–22% accept rate; of accepts, 30–55% resolve without callback
- Proactive outbound deflection (notifications before the call is placed): 5–15% volume reduction on the targeted intents
- Chat hand-off from voice: 10–25% accept; resolution rates similar to standalone chat

### The gross-to-net gap

The single largest delta in deflection reporting is gross vs net. Across deployments, net deflection on a 7-day window for the same intent runs 20 to 40 points below gross. The gap is largest on emotional or multi-step intents and smallest on clean transactional ones.

Any deflection figure quoted without a defined re-contact window should be treated as marketing, not measurement.

### How to baseline honestly

Three steps produce a baseline a finance director will accept: pull 90 days of pre-AI volume with intent tagging, define the re-contact window in writing (7 or 14 days depending on intent), and measure net deflection on a matched intent mix during the pilot. The exercise routinely halves the headline number — and routinely doubles the credibility of the business case.

### Key takeaways

- Realistic net deflection lands in three bands by intent: 50–75% transactional, 25–45% mixed, 10–25% complex.
- Gross vs net is the single largest delta — net runs 20–40 points below gross.
- Channel choice constrains the rate as much as intent does.
- Any number quoted without a defined re-contact window should be treated as marketing.
- Baseline against your own IVR, intent mix, and channel — not a blended vendor average.

### FAQs

**What is a realistic call deflection benchmark?**

There is no single number. Defensibly measured net deflection runs 50–75% on clean transactional intents, 25–45% on mixed intents, and 10–25% on complex intents. The blended figure for an enterprise contact centre depends entirely on intent mix.

**Why do vendor deflection rates look so much higher?**

Two reasons: they quote gross rather than net of re-contact, and they often exclude short abandons, out-of-hours, and out-of-scope intents. Each adjustment is individually defensible; combined they routinely lift the headline by 20–40 points.

**What re-contact window should I use?**

Seven days is the most common standard. Use 14 days for intents with longer natural resolution cycles, such as claims, disputes, or refunds.

**Is deflection the same as containment?**

Closely related. Deflection often includes calls prevented from reaching the queue at all (SMS, proactive outbound, web). Containment usually refers specifically to calls that entered the AI flow and were not escalated.

---

## When the cheapest AI voice vendor answered zero patient calls: the VERA framework

URL: /frameworks/vera-vendor-evaluation
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A large healthcare network selected an AI voice vendor on price. Zero live patient calls were processed before the engagement was terminated. The Vendor Evaluation & Risk Assessment (VERA) framework — two gateway criteria and six weighted domains — exists to stop that failure mode in any regulated, high-volume operation.

### The setup

Before joining the vendor side, I spent the better part of a year on the buy side of an AI voice procurement for a large multi-site healthcare enterprise. The organisation operated across multiple regions, ran a single mainstream practice management system (PMS) across every site, and had aggressive expansion plans.

I scanned the market. Seven AI voice vendors made the shortlist. I built an evaluation framework, ran the assessments, and produced a recommendation. The buyer bypassed the recommendation and selected on price.

The chosen vendor — a credible global enterprise software brand with big logos and public-sector deployments — passed every pre-sale criterion on paper. Implementation had three milestones: telephony integration, PMS integration via HL7v2, and pilot call testing with real patients. None were met. The engagement was wound down before the pilot ever ran.

### What actually broke

Three distinct failures, in order of severity.

- Latency. Humans take conversational turns with gaps of roughly 200 milliseconds. Sub-second is the baseline for natural feel; past about 1.5 seconds the call reads as a broken connection. The deployed system averaged over 4 seconds, with spikes past 20. Patients hung up. The ones who didn't talked over the system, which compounded the latency.
- Accuracy. The system confidently produced wrong information — wrong appointment slots, wrong practitioner names, wrong availability. In a retail call centre that is a complaint; in healthcare it is clinical risk and regulatory exposure. The root cause was inadequate PMS integration.
- Delivery capability. The implementation team had never integrated with that PMS and had never worked inside local primary-care workflows. What was sold as 'configuration' was new development work the team could not do. There was no escalation path and no senior healthcare engineer to bring in.

### The gap conventional procurement cannot see

Healthcare procurement frameworks were built for EMRs and clinical decision-support tools. They evaluate static products: you install the thing, you configure the thing, the thing runs. AI voice is not that. AI voice is a real-time autonomous system having unscripted conversations with patients about their healthcare.

The procurement question is no longer 'does the product work' — it is 'can this vendor deliver a working production system into our specific environment, against our specific systems, under our specific regulatory regime.' That question has three layers: compliance risk, technical integration risk, and implementation delivery risk.

Layers one and two are visible in an RFI. Layer three is invisible until after the contract is signed — and it is where this vendor failed completely.

### VERA gateway 1 — Data Sovereignty (not residency)

Residency is where data is stored. Sovereignty is whose laws and authorities can reach it. Under GDPR-style regimes, you do not have to permanently store data abroad to make a cross-border transfer — making it available to a processor in another country counts, including transient real-time access. The moment live patient voice is routed to an overseas speech-to-text or inference service, you have almost certainly made a cross-border transfer, even though your database never left the country.

More than half of the vendors evaluated stored data in-country and routed live voice processing through US infrastructure. Several did not understand the distinction. When asked where the speech-to-text inference ran, they could not answer.

Where data physically sits does not, by itself, determine which governments can compel access to it. The US CLOUD Act lets US authorities compel US-based providers to hand over data in their control regardless of which country the servers are in. The legal ground under transatlantic transfers also moves: the EU framework authorising them has already been struck down once (Schrems II, 2020) and rebuilt (the 2023 Data Privacy Framework, itself now under appeal). The lesson is not the case name — it is that an adequacy decision can be revoked, so you want sovereignty you control. Outsourcing the processing does not outsource the accountability. You remain the controller.

### VERA gateway 2 — Regulatory Compliance

Health-data rules are jurisdiction-specific, and that is the point: you map obligations to where care is delivered and where patients sit, rather than assuming one country's health-privacy law travels with the technology. In the US the sector-specific regime is HIPAA; in Europe, health data is 'special category' data under GDPR with heightened safeguards. They are not interchangeable, and a vendor quoting the wrong one at you is a tell.

- A jurisdiction-specific assessment for the actual market being bought for — not a generic policy retrofitted from somewhere else.
- A clear position on AI-specific rules now arriving. The EU AI Act's duty to tell people they are speaking to an AI takes effect in 2026, with heavier 'high-risk' obligations pushed to late 2027 under a 2026 amendment.
- Clarity on medical-device classification, which turns on intended use. An appointment-booking assistant is generally not a medical device; one that assesses symptoms or steers clinical decisions can be — and that triggers a far heavier regulatory pathway.

### The six weighted VERA domains

If a vendor fails either gateway, the assessment stops. Price is not a tiebreaker until the gates are passed. The weighted domains, in order:

- PMS / systems integration — verified, not promised. Sandbox test against the actual system. Reject vendors who have never seen the customer's PMS before.
- Telephony — domestic SIP, domestic PSTN, documented redundancy.
- Clinical safety — human handoff path, override controls, adverse-event reporting, patient AI disclosure, bias monitoring.
- Scalability — references at comparable scale, multi-site configuration management, centralised administration.
- Implementation delivery capability — the domain that did not exist before this failure. Named, healthcare-experienced delivery team. Milestone-based schedule with written acceptance criteria. Executive escalation path. Independent reference interviews with clients of comparable size — and you call them yourselves, not the references the vendor hands you.
- Commercial — price comes last, after everything else.

### What changed after

The replacement vendor cleared both gateways. Voice processing inside the customer's jurisdiction. Compliance documentation written for the actual regulatory regime, not retrofitted from somewhere else. PMS integration verified in a sandbox before contract. Independent reference calls with comparable healthcare networks, all confirming milestone delivery within agreed timelines. Calls got answered.

### Three lessons for operators

A polished demo predicts nothing about delivery. The failed vendor's demo was the best of the seven; their delivery was the worst. A demo measures sales capability. Implementation measures engineering capability. These are different functions inside the vendor, and you must assess them separately. The most reliable way is to make the vendor prove it before contract — sandbox the integration, simulate the calls, test against your actual systems and edge cases. What a vendor will not demonstrate before you sign, they usually cannot deliver after.

Price is the last filter, not the first. Every unit saved on AI voice procurement evaporates the moment the system goes down on a Monday at 9am with dozens of patients in the queue. The buyer spent more on the failed engagement than a year of the right vendor would have cost.

Data sovereignty is the question most vendors cannot answer cleanly. If you ask only one question in your next AI voice evaluation, ask where the speech-to-text inference physically runs. Get it in writing. Get the subprocessors named. If they hedge, the gate is closed.

### Key takeaways

- Choose on capability and delivery risk; price is the last filter, not the first.
- Data sovereignty (whose laws reach the data in real time) is distinct from residency and is the question most vendors cannot answer cleanly.
- Compliance must be jurisdiction-specific for where care is delivered — not a retrofitted generic policy.
- Implementation delivery capability is invisible in an RFI and is where most failures happen. Demand a named team with relevant experience and independent references you call yourself.
- Make the vendor prove integration in a sandbox before contract. What they will not demonstrate before signing, they usually cannot deliver after.

### FAQs

**What is the VERA framework?**

Vendor Evaluation & Risk Assessment: two gateway criteria (data sovereignty and regulatory compliance) and six weighted domains (integration, telephony, clinical safety, scalability, implementation delivery capability, and commercial). A vendor that fails either gateway is out, regardless of price.

**Why distinguish data sovereignty from data residency?**

Residency is where data is stored. Sovereignty is whose laws and authorities can reach it. Live voice routed to an overseas inference service is a cross-border transfer even if the database never leaves the country — and outsourcing the processing does not outsource the accountability.

**Is an AI appointment-booking assistant a medical device?**

Generally not. Classification turns on intended use. An assistant that assesses symptoms or steers clinical decisions can fall into the medical-device pathway, which is significantly heavier than non-clinical scheduling automation.

**What is the single most useful question to ask an AI voice vendor?**

Where does the speech-to-text inference physically run, and which subprocessors handle it. Get the answer in writing. Vendors that hedge on this question rarely have a defensible sovereignty story.

---

## Voice AI RFP template: what to actually ask, and how to score the answers

URL: /guides/voice-ai-rfp-template
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** An RFP that asks 'do you support barge-in?' gets back 'yes' from every vendor on the long list. An RFP that asks 'demonstrate barge-in in a call where the caller interrupts a four-second response, and show the turn-taking latency' eliminates two thirds of them. This template is the second kind.

### How to use this template

Treat the RFP as the first integration test, not a paperwork exercise. Every question below is written so the answer either ships an artifact, names a number, or is a disqualifier. Score independently before reconciling. Pre-commit the weightings — borrowed from the evaluation matrix — before any responses arrive.

### Section A — Company, references, delivery (weight: 15%)

The most predictive section and the one most often skipped. Implementation delivery capability is invisible in capability claims.

- Name three customers at our scale (±50%) in our region and industry. We will call them directly, not the references you nominate.
- Name the delivery team that would run our implementation — by individual, with relevant tenure and prior projects of comparable shape.
- Provide the escalation path with names and SLAs at the engineering, product, and executive levels.
- Describe two implementations that went poorly. What broke, what changed in your delivery model as a result.
- Provide your written change-of-control policy. What happens to our deployment if you are acquired in year two.

### Section B — Data sovereignty and security (weight: 20%, gateway)

Sovereignty is whose laws can reach the data, not where it is stored. Vendors who cannot answer cleanly are filtered here, not later.

- Provide a data-flow diagram per call leg: capture, speech-to-text, retrieval, inference, text-to-speech, telephony, storage. Name the legal entity and jurisdiction operating each.
- Name every sub-processor with the function performed and the country of processing. State your change-notification SLA when sub-processors change.
- Provide SOC 2 Type II, ISO 27001, and (if relevant) HITRUST. State the audit period and any qualifications.
- Provide your DPA, including international transfer mechanism, retention defaults, deletion SLAs, and the exit-and-destruction plan.
- State whether models are trained on customer data by default. State the opt-out mechanism and how it is enforced technically.

### Section C — Regulatory compliance (weight: 10%, gateway)

Jurisdiction-specific to where care or service is delivered and where the caller sits. A vendor quoting the wrong regime at you is a tell.

- Provide the jurisdiction-specific compliance assessment for our actual market — not a global retrofit.
- State your position on the EU AI Act AI-disclosure obligation (in force 2026) and your roadmap for the high-risk obligations under the 2026 amendment.
- Provide evidence of consent capture for call recording in each jurisdiction we operate.
- State your position on medical-device classification (healthcare) or FCA / NAIC / regional financial regulator scope (financial services).

### Section D — Integration depth (weight: 20%)

Read-only against generic connectors is table stakes. Write, idempotency, failure handling, and audit are where deployments survive.

- Demonstrate a write to our actual system of record in a paid sandbox, with the auth pattern documented. Show what happens when that write fails.
- List the contact-centre platforms, CRMs, IVRs, identity providers, and ticketing systems with production integrations, not roadmap.
- Describe your idempotency model for actions that touch downstream systems.
- Provide the audit format for every write. We will need it for our SOX / regulatory audit.

### Section E — Operating model and control surface (weight: 15%)

The conversation owner — a senior contact-centre operator, not an engineer — is the highest-leverage role. Vendors that lock changes behind engineering tickets fail in production.

- Demonstrate a non-engineer changing an intent, deploying to staging, and rolling back in under one hour.
- Provide the audit log for the last ten changes to a customer deployment, by author and revert path.
- Describe the staging-to-production promotion model, including diff review and approval gates.
- Describe the per-call observability available to the conversation owner: transcript, intent labels, tool calls, latency per step, escalation reason.

### Section F — Performance and latency (weight: 10%)

End-to-end turn latency above 1.5 seconds reads as a broken connection. The number to ask for is p95 under realistic load, not mean in a demo.

- State the p95 end-to-end turn-taking latency under 20 parallel sessions, using a script we provide.
- Demonstrate graceful barge-in: the caller interrupts a four-second response, and the system yields without losing turn context.
- Provide the latency budget per step (ASR, retrieval, LLM, TTS, telephony) for a representative production call.
- Describe your behaviour under degraded LLM provider performance — failover, graceful degradation, caller-facing experience.

### Section G — Commercial (weight: 10%)

Price last, after everything else. Model the deployment at 0.5x and 2x our forecast volume and containment.

- Provide pricing under two scenarios: 50% of our forecast call volume and 200% of it. State which lines move and by how much.
- State the per-minute floor on escalated calls and the rules for the AI-exposure minutes before transfer.
- Provide the exit terms: data export format, timeline, proof-of-destruction, and any disengagement fee.
- State the price-protection mechanism on contract renewal and the conditions under which it lapses.

### Scoring sheet

Each evaluator scores independently per section on the 1 / 3 / 5 scale below, then multiplies by the section weight. Reconciliation happens after all scores are recorded. The two gateway sections (B, C) are also pass / fail — a fail removes the vendor regardless of total score.

### Disqualifying answers

Short list of answers that should remove a vendor regardless of section score. These are the patterns that, in practice, predict a failed implementation.

- Cannot name where speech-to-text inference physically runs, or hedges on sub-processors.
- Refuses to do a paid sandbox integration against the customer's actual system of record before contract.
- Quotes the wrong regulatory regime (HIPAA at a UK buyer, GDPR generics at a US healthcare network).
- Cannot produce the audit log for the last ten changes to any customer deployment.
- References are all the vendor's nominated contacts; will not allow independent outreach.
- Pricing model contains an unbounded overage clause without a renegotiation trigger.

### Key takeaways

- An RFP that asks for evidence eliminates two thirds of the long list; one that asks for capability claims does not.
- Treat data sovereignty and regulatory compliance as pass / fail gateways, not weighted dimensions.
- The most predictive section is delivery capability — named team, independent references, two implementations that went poorly.
- Require a paid sandbox integration with the shortlisted vendors before contract; what they will not demonstrate before signing, they usually cannot deliver after.
- Score independently, reconcile after, pre-commit the weightings before any responses arrive.

### FAQs

**How long should a voice AI RFP take to run?**

Six to eight weeks from issue to shortlist for an enterprise procurement, plus a further four to six weeks for paid sandbox integration with the shortlisted two or three. Compressing below this usually means skipping evidence in favour of capability claims.

**Why score independently before reconciliation?**

Group scoring drifts to the median and hides disagreement. Independent scores expose where evaluators saw different things in the same answer — which is exactly where the conversation that matters happens.

**What is the single most useful question in a voice AI RFP?**

'Demonstrate a write to our system of record in a paid sandbox, and show what happens when that write fails.' Most failed deployments fail on this seam, not on conversation quality.

**Should the RFP name pricing targets?**

No. Anchoring on price up front causes vendors to game the per-minute line and recover margin in implementation services or overages. Score capability and delivery first; commercial terms last.

---

## Voice AI security questionnaire: the questions IT-Sec actually needs answered

URL: /guides/voice-ai-security-questionnaire
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Voice AI moves live customer audio across multiple services in real time. A generic SaaS security questionnaire does not surface where any of it goes. This is the voice-specific addendum that should sit alongside your standard one — written so each answer either ships an artifact, names a jurisdiction, or is a disqualifier.

### A — Data flow and per-call-leg residency

The single highest-value section. Most vendors will quote 'data residency in your region' and route inference through a US-based model provider. Both can be true. Ask for the per-call-leg picture.

### B — Sub-processors

The sub-processor list, not the headline residency, determines where data actually goes. Change notification is where most DPAs are weakest.

- Provide the current sub-processor list with function, entity name, and country of processing.
- State the change-notification SLA when a sub-processor is added, removed, or replaced.
- Provide the customer's right to object to a new sub-processor and the consequence (termination right, transition support).
- Confirm that sub-processors are bound by terms no weaker than the master DPA, with audit rights flowing through.

### C — Model training and data use

The default matters more than the opt-out. A vendor whose default is 'we train on customer data unless you opt out' has already exposed historical data before the contract is signed.

- State the default position on training models on customer data: yes, no, or opt-in.
- Where opt-out is available, describe the technical enforcement and the audit evidence the customer can request.
- State whether prompts, transcripts, retrieval inputs, or derived embeddings are retained for any vendor purpose beyond service delivery.
- State the retention period for each data class (audio, transcript, retrieval logs, model inputs) and the deletion SLA on customer request.
- Confirm there is no model fine-tuned on the customer's data that persists after contract termination.

### D — Identity, access, and audit

- Confirm SSO via SAML or OIDC, with named identity providers tested in production.
- Describe the role model and the principle of least privilege as applied to vendor staff with access to customer environments.
- Provide the audit log format: who, what, when, before-value, after-value. Confirm export to the customer's SIEM.
- State the cadence of internal access reviews and the artefact the customer can request.

### E — Incident response

- State the breach-notification SLA from confirmation, not from disclosure decision.
- Provide the last three customer-impacting incident summaries with root cause and remediation. Redact customer names; keep the technical detail.
- Describe the on-call coverage model: hours, escalation, named accountable engineer per severity.
- Provide the disaster-recovery RPO and RTO for the voice service and the evidence that they have been tested in the last twelve months.

### F — Compliance attestations and certifications

- Provide SOC 2 Type II for the most recent twelve-month audit period, including any qualifications.
- Provide ISO 27001 certificate and statement of applicability.
- Provide HITRUST (healthcare), PCI-DSS (payments), or sector-specific attestations as applicable.
- Confirm GDPR and UK GDPR posture, including representative in the EU and UK where required.
- State the EU AI Act position: AI-disclosure obligation in force 2026 and the high-risk obligations roadmap under the 2026 amendment.

### G — Exit, portability, and destruction

The cheapest leverage you will negotiate is at signature, before the deployment exists. The most expensive is at exit.

- Provide the data export format and the timeline from termination notice to delivery.
- Confirm proof of destruction across primary, backup, and sub-processor systems within a stated SLA.
- Confirm transition support is available for a defined window post-termination, with a published rate card.
- Confirm there is no contractual barrier to migrating prompts, intents, retrieval content, or call logs to another platform.

### DPA clauses the customer should not concede

The list of clauses that, in practice, are negotiable and worth holding the line on. These are the ones that materially change risk exposure rather than legal hygiene.

### Disqualifying answers

- Cannot or will not name the model provider for ASR, LLM, and TTS.
- Default is to train on customer data with opt-out (not opt-in).
- Sub-processor list provided but change-notification SLA is undefined.
- Breach notification SLA starts at 'disclosure decision' rather than 'confirmation'.
- Exit clause has no SLA on data export or no proof-of-destruction obligation.
- Cross-border transfer mechanism is 'we'll provide SCCs at signature' rather than the actual document attached.

### Key takeaways

- Ask for the data-flow diagram per call leg, not the headline residency statement.
- Default position on training on customer data is the single biggest risk variable — require opt-in, not opt-out.
- Treat sub-processor change notification with objection rights and transition support as a non-negotiable.
- Breach notification SLA should start at confirmation, not at disclosure decision.
- Attach the SCCs and transfer impact assessment to the DPA — not by reference.

### FAQs

**Is a SOC 2 Type II enough for voice AI procurement?**

No. SOC 2 attests to controls at the entity level; it does not tell you where speech-to-text inference physically runs, which sub-processors handle customer audio, or whether models are trained on customer data by default. Treat SOC 2 as a baseline, not as the answer.

**What is the difference between data residency and data sovereignty?**

Residency is where data is stored at rest. Sovereignty is whose laws and authorities can compel access to it. Live voice routed to an overseas inference service is a cross-border transfer even if the database never leaves the country.

**Should we accept SCCs by reference in the DPA?**

No. Attach the current version of the SCCs and the transfer impact assessment as DPA appendices. 'By reference' creates ambiguity if the standard updates and the contract is silent on which version applies.

**How often should sub-processor change notifications be required?**

Thirty days is standard. The clause that matters more is the customer objection right with termination-for-convenience and a defined transition-support obligation — without it, the notice is informational only.

---

## Voice AI first 90 days: a week-by-week post-launch operating plan

URL: /guides/voice-ai-first-90-days
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Most voice AI deployments do not fail at launch; they fail in the operating model that congeals around them in the first 90 days. This is the week-by-week plan to install instead.

### Weeks 1–2 — stand up the cadence and the observability

Before any tuning, install the cadence. One ninety-minute weekly meeting, the same three roles, a written decision log. Confirm the conversation owner can see per-call transcript, intent labels, tool calls, latency per step, and escalation reason without engineering involvement. If they cannot, that is your week-one bug, not a week-twelve concern.

### Weeks 3–4 — baseline before you tune

Publish the baseline before changing anything. Containment, re-contact within 7 days, CSAT, cost per resolved call, latency p95. Use the exact methodology the pre-AI measurement used so the comparison is honest. Skip this and the post-launch metric will be measured against whatever produces the most flattering number.

### Weeks 5–8 — small attributable changes

Ship the first two intent or guardrail changes per week, each with a metric they are meant to move and a control window. Resist the vendor pressure to bundle changes — bundles are unattributable. Resist the executive pressure to ship the big intent expansion — that comes in month four.

- Two changes per week, no more — attributability matters more than throughput
- Each change names the metric it is meant to move and the window over which it will be measured
- Roll back any change that does not move the metric within two weeks; do not let dead changes accumulate

### Weeks 9–12 — extension and the first board pack

Extend the intent backlog by no more than two items. Catalogue failure modes that emerged at month-two scale but not month-one. Write the first quarterly board pack: one page, five lines, no dashboard exports.

- Primary metric vs target, with one sentence on any miss
- Top three failure modes this quarter, with the change shipped against each
- Top three failure modes carried, with why they remain open
- What shipped, what is shipping next quarter
- Single forward risk most likely to compromise the next quarter

### What good looks like at day 90

### Key takeaways

- Install the cadence in week one; install observability in week two; baseline before tuning anything.
- Two attributable changes per week beats five bundled changes — roll back any change that does not move its metric within two weeks.
- Extend the intent backlog by no more than two items in the first 90 days.
- Day 90 board pack is one page, five lines — not a dashboard export.
- Operations leads the weekly review; vendor attends.

### FAQs

**Why baseline before tuning?**

Because the only honest comparison is against the methodology the pre-AI measurement used. Tuning first means every post-launch number is measured against whatever produces the most flattering comparison.

**Two changes per week feels slow. Why cap it?**

Because bundled changes are unattributable. Three or more changes in a week and you cannot tell which one moved the metric, which means you cannot roll back the dead ones.

**Should the vendor own the cadence?**

No. The contact-centre operations lead chairs the weekly review. Vendor attends, contributes, does not chair. The cadence is the operating model; the operating model belongs to the customer.

---

## Voice AI QA rubric: a call-review template the operating model can actually run

URL: /guides/voice-ai-qa-rubric
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A call-review rubric is the cheapest mechanism to make a weekly operating cadence compound. Without it, the deployment improves on whichever dimension the loudest reviewer raises. With it, improvement is directional and visible to a steering committee.

### The eight QA dimensions

Score each call sampled on the dimensions below, 1 / 3 / 5. A dimension is either applicable to the call or marked N/A; do not score it 3 by default.

### Sampling strategy that produces a defensible weekly score

A useful sample is 20 to 40 calls per week, stratified — not random and not vendor-curated. Vendor-curated samples drift toward calls that score well; random samples under-represent the failure modes that matter.

### How the rubric feeds the operating cadence

The rubric does not live in a spreadsheet nobody opens. It feeds the weekly review directly: the lowest-scoring dimension becomes the top intent / guardrail change for the coming week; the highest-variance dimension becomes the calibration topic for the reviewer team.

- Weekly: aggregate scores per dimension, named lowest-scoring dimension, named highest-variance dimension
- Monthly: trend lines per dimension, calibration session for any dimension where reviewer variance exceeds one point on average
- Quarterly: rubric review — add a dimension, retire one, rewrite a level if scoring guidance has drifted

### Calibration — keeping reviewer scores honest

Two reviewers can score the same call three points apart. A calibration session every month — pick five calls, everyone scores independently, reconcile in a room — is the cheapest way to keep the rubric honest. Without it, the rubric becomes a measure of reviewer identity, not call quality.

### Key takeaways

- Score eight dimensions, 1 / 3 / 5, with explicit guidance per level — no defaulting to 3.
- Sample 20–40 calls weekly, stratified across contained and escalated, never vendor-curated.
- The lowest-scoring dimension becomes next week's top change; the highest-variance dimension is the next calibration topic.
- Calibrate reviewers monthly or the rubric measures reviewer identity, not call quality.
- Rubric belongs to the customer's operating model, not the vendor.

### FAQs

**How many calls should we review per week?**

Twenty to forty, stratified across contained and escalated calls and across intent classes. Fewer than 20 and the variance dominates the signal; more than 40 and reviewer fatigue degrades scoring quality.

**Should the vendor do the QA?**

No. The QA function belongs to the customer's operating model. The vendor can contribute calls, can attend reviews, cannot own the rubric or the score.

**How is this rubric different from human-agent QA?**

It carries the same conversational and compliance dimensions but adds escalation hygiene, intent capture, and data handling — the dimensions where AI deployments fail in ways human agents do not.

---

## Voice AI to live-agent handoff: the patterns that survive production

URL: /guides/voice-ai-live-agent-handoff-patterns
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** The single most predictive measure of post-launch satisfaction is not containment; it is the experience of the escalated caller. Get the handoff right and a 30%-contained deployment outperforms a 60%-contained one with blind transfers.

### Five handoff patterns and when to use each

### What context to carry — the non-negotiables

Every handoff carries the same five fields regardless of pattern. Anything less and the caller re-explains, which is the single biggest CSAT killer at handoff.

- Identity — verified or stated, with the verification status named
- Intent — captured in machine-readable form, not just a free-text summary
- Action attempted — what the AI tried to do, and the system response if any
- Reason for escalation — named, not inferred from absence
- Caller emotional state — flagged if frustration, distress, or vulnerability signal was detected

### Measuring handoff quality in production

Four metrics, every week, on every handoff pattern. Without them the handoff seam stays opaque and degrades silently.

- Re-explanation rate — % of escalated calls where the agent asks the caller to re-state the intent
- Handoff handle-time penalty — agent handle time on escalated calls vs baseline pre-AI calls for the same intent
- Post-handoff CSAT — escalated calls scored separately, not blended with contained
- Repeat-contact within 7 days following handoff — separate from overall re-contact

### What the supervised-handoff pattern actually costs

Every new intent gets the supervised pattern for the first 30 days post-launch. A supervisor — usually a senior contact-centre operator — listens to a sample of live AI calls and flags low-confidence turns for live intervention. The cost is 0.1 to 0.2 FTE per concurrent live deployment for the first quarter. Skipping it saves the FTE and pays it back two-fold in post-launch firefighting.

### Key takeaways

- Handoff experience predicts post-launch CSAT more reliably than containment rate.
- Five named patterns cover the realistic cases — match pattern to intent and queue depth, not vendor preference.
- Every handoff carries identity, intent, action attempted, escalation reason, and emotional state — no exceptions.
- Measure re-explanation rate, handoff handle-time penalty, post-handoff CSAT, and post-handoff re-contact separately every week.
- Run the supervised pattern for 30 days on new intents, 90 days on regulated ones.

### FAQs

**Why is warm transfer not always the right answer?**

Because warm transfer at high volume queues callers behind agents who have to read the summary before answering. For simple intents, screen-pop is faster and the context still arrives in time.

**What is the single biggest cause of bad handoffs?**

The transcript summary is too long. Agents do not read past line three. Two lines and a structured intent code is the right shape for cold transfer; three to five lines for warm.

**How long should the supervised pattern run for a new intent?**

Thirty days for low-risk intents, ninety days for regulated or high-risk intents. Removing the supervisor before failure modes have surfaced at scale is the most common cause of post-launch incident clusters.

---

## Voice AI latency budget: where the milliseconds actually go

URL: /guides/voice-ai-latency-budget
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Latency does not degrade evenly. It collapses one step at a time, and the step is usually retrieval, not the model. This is the per-step budget and the diagnostic that finds the regression in minutes, not weeks.

### The end-to-end budget — 1.0 second p95 target

A defensible production target is 1.0 second p95 end-to-end turn latency under realistic load. The budget below allocates that across six steps and leaves headroom for the unpredictable.

### Where regressions usually come from

In production, latency regressions cluster in a small set of causes. The diagnostic order below catches most within an hour.

### Barge-in and the latency budget

Barge-in adds a separate budget: from the moment the caller starts speaking over the AI to the moment the AI stops outputting audio, the target is under 250ms p95. Above 500ms and barge-in reads as the AI ignoring the caller, which is worse than no barge-in at all.

### What to measure weekly

- End-to-end p95 turn latency under realistic load — not single-threaded demo
- Per-step p95 latency for ASR, retrieval, LLM, TTS, telephony
- Barge-in p95 (caller speech onset to AI audio stop)
- p99 jitter and packet loss on the telephony leg
- Latency outliers (above the stretch budget) tagged by step and root cause

### Key takeaways

- Target 1.0 second p95 end-to-end turn latency under realistic load — not mean, not single-threaded.
- Retrieval is the most common regression cause; LLM provider is the most common cause executives blame first.
- Barge-in has a separate budget — under 250ms p95 from caller speech onset to AI audio stop.
- Measure per-step p95 weekly, not just the end-to-end number.
- Streaming end-to-end (ASR → LLM → TTS) is not optional at the 1.0-second target.

### FAQs

**Is 1.0 second p95 realistic?**

Yes, but it requires streaming end-to-end (ASR final streaming into LLM, LLM streaming into TTS), disciplined retrieval design, and a carrier-grade telephony path. Most production deployments without those land at 1.4–1.8 seconds and read as slightly laggy.

**Why is mean latency a misleading number?**

Because callers experience tail latency, not mean. A deployment with mean 800ms and p95 2400ms feels broken on one call in twenty, and that is the call the executive hears about.

**How much headroom should the budget leave?**

Five to ten percent for unpredictable network and carrier variance. The stretch column in the table above is what a single bad call leg can sustain without breaking the conversation.

---

## Voice AI kill criteria: when to stop a pilot, in writing, before it starts

URL: /guides/voice-ai-kill-criteria
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A pilot without kill criteria is not a pilot; it is a permanent project waiting to be re-labelled. Pre-commit the five binary gates below, signed by the four people who can call the no, before the first integration ticket is opened.

### The five kill gates

Each gate is binary: pass or fail. A single fail at the decision date triggers a stop conversation with the four named decision-makers. Two fails is an automatic stop.

### Who can call the no

Four roles, signed at kickoff, named in the success contract. Any one can call the conversation; consensus of three is needed to actually stop.

- Contact-centre operations lead — closest to the caller experience
- Finance sponsor — owns the business case
- Transformation sponsor — owns the programme delivery
- Compliance lead — owns the regulatory exposure

### What 'failing the wrong way' looks like

Some pilots fail the right way: the gate is hit, the metric is honest, the decision is no. Those are useful. Other pilots fail the wrong way — they drift past the gate, the metric is contested, the decision is postponed. The drift state has three recognisable signs.

- Methodology debates appear in steering committee at the same time as failing numbers
- Scope is quietly redefined to exclude the intent where the AI underperforms
- Decision date slips by 'just two more weeks' more than once

### The reset, not the failure

A killed pilot is not a failed programme. The reset has a defined shape: smaller scope (one intent class, not three), tighter timeline (eight weeks, not twelve), narrower vendor list (the one that did not lose on integration depth). The reset starts faster than the original because the operating model, observability, and compliance work survive the kill.

### Key takeaways

- Pre-commit the five gates in writing, signed by four named decision-makers, before the first integration ticket.
- Two failed gates is an automatic stop; one is a stop conversation.
- Re-contact within 7 days is the most commonly failed gate; containment is the most commonly contested.
- Methodology debates plus failing numbers plus a slipped decision date is the drift state — call the no.
- A killed pilot is not a failed programme; the reset starts faster than the original.

### FAQs

**Why pre-commit kill criteria before the pilot starts?**

Because in-flight criteria are unenforceable. The team that designs the criteria once the numbers are bad cannot help but design ones the numbers pass. Pre-commitment removes the conflict of interest.

**Five gates feels like a lot — why not three?**

Because each gate covers a different failure mode. Containment alone hides re-contact, cost alone hides the operating model, operating cadence alone hides compliance. The five together are the smallest set that catches the realistic failures.

**What is the most common kill-gate failure?**

Re-contact within 7 days. Containment looks fine in week eight; re-contact tells you in week ten that the contained calls were not actually resolved.

---

## Voice AI board pack: the one-page template for the steering committee

URL: /guides/voice-ai-board-pack
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** If the executive sponsor cannot read the entire report in three minutes, the report fails. One page, five lines, written prose, no dashboard exports.

### The five-line template

Each line is one paragraph, two to four sentences. No charts, no tables, no exported tiles. The discipline of writing it produces sharper thinking than any dashboard ever will.

### The four questions sponsors actually ask

Every steering committee asks the same four questions in some form. The pack should answer each before the question is asked.

- Are we on track against the business case? — primary metric line
- What are we doing about the thing that broke last month? — carried failure-mode line
- What is the next big risk? — forward-risk line
- What happens if we pull funding? — implied in the carried failure-mode line; make it explicit in the appendix

### What not to include

- Dashboard exports — the sponsor will not scroll
- Vendor-supplied slides — the sponsor cannot judge them
- Quarter-on-quarter charts before the second quarter — there is no trend yet
- The full failure-mode catalogue — top three carried, top three shipped, nothing more
- Long-form vendor evaluation language — that lives in the procurement record, not the operating board pack

### The appendix — what belongs there

Behind the one page sits a structured appendix the sponsor can open if they want to. Three things, in order.

- Methodology note — how every number on the front page is calculated, dated to the current quarter
- Risk register — full risk list with owner, mitigation, and date; the front page lifts only the top one
- Kill-criteria status — each of the five gates, current status, trend versus last quarter

### Key takeaways

- One page, five lines, written prose — no dashboard exports.
- Primary metric, top three shipped, top three carried, what shipped / what is shipping, forward risk.
- Always pair containment with re-contact; never report one without the other.
- Appendix carries the methodology note, full risk register, and kill-criteria status.
- Operations lead writes the pack; vendor contributes, does not author.

### FAQs

**Why five lines and not more?**

Because the constraint forces prioritisation. A ten-line pack lets the team include the comfortable lines; the five-line pack forces the uncomfortable ones to the front.

**Who writes the board pack?**

The operations lead, not the vendor. The pack is the customer's narrative about the programme, not the vendor's report on it.

**What is the most common board-pack mistake?**

Reporting containment without re-contact. Containment alone reads as good news; the moment re-contact is added, the picture is honest. Steering committees notice when the second number is missing.

---

## Voice AI for FCA-regulated contact centres: a Consumer Duty compliance checklist

URL: /guides/voice-ai-fca-consumer-duty-compliance
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Voice AI in FCA-regulated contact centres must clear Consumer Duty, SYSC 10A call-recording, UK GDPR/ICO, and SM&CR accountability gates before any capability conversation. The single most-failed gate is vulnerable customer detection — most platforms cannot evidence it to a complaint handler's standard.

### The four veto questions before any capability scoring

FCA-regulated firms routinely score voice AI vendors on capability — containment rate, latency, integration depth — and only address compliance once a preferred vendor is chosen. By that point the firm has spent six figures on procurement and discovers the platform cannot evidence vulnerable customer handling to a complaint handler's standard. The deal collapses or, worse, ships with risk that surfaces in a Section 166.

Reorder the evaluation. Run these four veto questions before any vendor demo. A vendor that fails any of them is out of the process, no matter how strong the headline metrics.

### Consumer Duty (PRIN 2A) on AI-handled calls

Consumer Duty applies on outcome, not on channel. The four outcomes — products and services, price and value, consumer understanding, and consumer support — are tested against the customer's experience, regardless of whether a human or a voice AI delivered the service.

Two outcomes carry the highest risk in voice AI deployments. Consumer understanding requires that the customer can act on the information they receive — a voice AI that uses jargon, speaks too quickly, or fails to confirm understanding fails this outcome. Consumer support requires that customers can access support that meets their needs — a voice AI that loops a vulnerable customer through three failed intents before transferring fails this outcome even if the eventual human resolved it.

Evidence is the FCA's currency. The deployment must log, per call, the outcome the AI delivered, any signals of harm or vulnerability detected, and the action taken in response. A platform that cannot export that data set is not Consumer Duty-ready.

### SYSC 10A: call recording, retention, and the AI-only call

SYSC 10A requires firms to record telephone conversations and electronic communications relating to in-scope activities and to retain them for at least five years (seven for MiFID II business, with the option to extend to seven years for the Senior Manager regime). The rule was written for human-handled calls and is silent on AI-only calls — which is precisely where most firms get this wrong.

The FCA's position, confirmed in supervisory letters, is that an AI-handled call that would have been recorded if a human took it must be recorded if the AI takes it. A platform that retains a transcript or a model-generated summary but not the underlying audio is not compliant for SYSC 10A purposes. The audio is the record.

Two operational consequences follow. First, the platform must write audio to a retention store the firm controls, not a vendor-managed bucket the firm cannot subpoena. Second, the retrieval workflow must work for AI-only calls the same way it works for agent-handled calls — a Subject Access Request that returns audio for the human-handled half of a session but only a transcript for the AI-handled half is a finding.

### UK GDPR, the ICO, and automated decision-making

Three UK GDPR provisions bite hardest on voice AI deployments. Article 6 (lawful basis) is usually straightforward — legitimate interest or contract performance covers most contact-centre handling. Article 9 (special category data) becomes live the moment the AI handles a health, financial-vulnerability, or biometric voice-print signal; the firm needs an Article 9 condition and an appropriate policy document.

Article 22 (automated decision-making) is the one most firms misread. It applies only where the decision produces legal effects or similarly significantly affects the customer. A voice AI that routes a call or schedules a callback does not trigger Article 22. A voice AI that approves or declines a loan top-up, accepts or rejects a claim, or applies a fee waiver does. For Article 22 use cases, the customer has a right to human review, the firm must explain the logic involved, and a DPIA is mandatory.

The ICO's 2023–2024 guidance on AI and data protection sets a higher bar than many vendors assume. Expect to evidence: a DPIA covering the model and its training data; a record of testing for bias and accessibility; a written policy on how the firm explains AI decisions to customers; and a route for the customer to contest a decision and reach a human.

### SM&CR: who is accountable when the AI gets it wrong

Under the Senior Managers and Certification Regime, every prescribed responsibility must sit with a named Senior Manager Function holder. A voice AI deployment touches at least three: SMF3 (executive director) or SMF1 (CEO) at the strategic level, SMF16 (compliance oversight) for the regulatory framing, and SMF17 (MLRO) where the AI handles transactional flows that could touch financial crime. SMF24 (chief operations) is where most firms land day-to-day accountability.

The deployment paperwork must name the SMF holder accountable for outcomes and reference the deployment in their Statement of Responsibilities. A vendor cannot accept this accountability — it stays with the firm. The vendor's contract should, however, give the named SMF the access and information they need to discharge it: real-time outcome data, transcripts, audio, and an escalation path for incidents.

### Vulnerable customer detection: the most-failed gate

FG21/1 sets out the FCA's expectations on the fair treatment of vulnerable customers. The expectation is not that the firm prevents harm in every case — it is that the firm can evidence it identified vulnerability signals, made a proportionate decision, and routed the customer appropriately.

Voice AI platforms fall into three tiers on this. Tier 1 platforms detect no vulnerability signals and have no logging of routing decisions; they are not deployable in an FCA-regulated contact centre without a wrapper that adds these capabilities. Tier 2 platforms detect a limited signal set (typically explicit phrases like "struggling" or "can't pay") and route on a hard rule; they are deployable but the firm must own the rule logic and document it. Tier 3 platforms detect a broader signal set (prosody, hesitation, repeat-contact within a short window, drivers of vulnerability under FG21/1's four categories) and produce per-call evidence; these are the only platforms that can be deployed without a compensating control.

The veto question is not whether the platform claims to detect vulnerability — every vendor claims this. The veto question is whether, given a customer who later complained about being mis-handled while vulnerable, the platform can produce the signals it saw, the decision it made, and the route it took. If the answer requires the vendor to query their own logs and email a PDF, the firm is not in control of its own evidence.

### What to add to your voice AI RFP

Bake the regulatory questions into the RFP at the same weight as capability. Vendors who cannot answer these in writing should not progress past first round.

- Provide a worked example of how a vulnerable customer signal is detected, logged, and routed. Include the data fields exported per call.
- Confirm where call audio is stored, who controls the retention store, and how the firm retrieves audio for a SAR within statutory deadlines.
- Provide your DPIA template covering model training data, bias testing, and accessibility testing.
- Confirm whether any deployed flow falls within UK GDPR Article 22, and provide the human-review workflow if so.
- Confirm the platform can produce, per call, the data set a complaint handler needs: transcript, audio, intent classification, vulnerability signals, decision logic, and routing decision.
- Provide the contractual access and information rights the named SMF holder will have, including incident-notification SLAs.
- Confirm SOC 2 Type II, ISO 27001, and Cyber Essentials Plus where applicable. The FCA expects evidence, not assertions.

### Key takeaways

- Consumer Duty applies to AI-handled calls the same way it applies to human-handled calls — outcome, not channel, is what is tested.
- SYSC 10A requires call recording and 5-year retention; an AI that summarises but does not retain the underlying audio is not compliant on its own.
- Vulnerable customer detection must be evidenceable, not just claimed — log the signals, the decision, and the route taken.
- ICO guidance on automated decision-making (UK GDPR Art. 22) applies if the AI takes a decision with significant effect — most contact centre flows do not, but lending and claims do.
- Every voice AI deployment in an FCA firm needs a named SMF holder (typically SMF24 or SMF3) accountable for outcomes — name them in the deployment paperwork or the firm is in breach of SM&CR.

### FAQs

**Does Consumer Duty apply to AI-handled calls?**

Yes. Consumer Duty applies on outcome, not on channel — an AI-handled call is tested against the same four outcomes as a human-handled call. The firm must be able to evidence the outcome it delivered to each customer.

**Do we need to record calls the AI handles end-to-end?**

Yes, if the call would have been recorded had a human taken it. SYSC 10A applies to the activity, not the handler. The audio — not a transcript or a summary — is the record, and must be retained for five years (seven for MiFID II business).

**When does UK GDPR Article 22 apply to voice AI?**

Article 22 applies when the AI's decision produces legal effects or similarly significantly affects the customer — loan approvals, claim decisions, fee waivers. Routine routing, scheduling, and information delivery do not trigger Article 22, but they still need Article 6 (and often Article 9) cover.

**Who is the named accountable Senior Manager for a voice AI deployment?**

Most firms place day-to-day accountability with SMF24 (chief operations), with SMF16 (compliance oversight) for regulatory framing. The deployment must appear in the named SMF's Statement of Responsibilities — a vendor cannot accept this accountability on the firm's behalf.

**What is the most common reason a voice AI fails FCA scrutiny?**

Inability to evidence vulnerable customer handling to FG21/1 standards. Vendors claim detection capability, but the test is whether — given a later complaint — the platform can produce the signals it saw, the decision it made, and the route it took. Most cannot.

---

## Agentic voice AI in the enterprise: what's real in 2026

URL: /guides/agentic-voice-ai-enterprise
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Agentic voice AI means the voice agent can plan multi-step work, call tools against systems of record, and recover from failure mid-call — not just answer questions from a knowledge base. In production today it works for bounded transactional intents; it does not yet work for open-ended judgement-heavy calls, and most vendor demos blur the line.

### What 'agentic' actually means in a voice context

An agentic voice AI plans a sequence of steps to complete a customer request, executes tool calls against external systems, observes the result, and adapts. A non-agentic voice AI follows a pre-authored flow with conditional branches. Both can sound similar in a demo; they behave very differently under failure.

The defining test is not language fluency. It is whether the system can recover when a tool returns an unexpected payload — a 500 from the CRM, a partially fulfilled order, a customer who changes their mind mid-flow — without escalating or hallucinating a confirmation.

### What works in production in 2026

Three classes of intent are reliably deployable agentically today: read-heavy authentication and lookup chains, single-system write actions with strong schemas (appointment booking, address update, payment-method change), and orchestrated read-then-route flows where the AI gathers context before warm-transferring.

The common pattern is bounded scope, idempotent writes, and a system of record with a stable API contract. Where any of those is missing, the agentic layer adds risk faster than it adds value.

- Authenticate-then-lookup chains across two or three systems of record
- Single-system writes with idempotency keys (booking, update, cancellation)
- Read-then-route handoff that pre-fills the agent desktop
- Outbound confirmation, reschedule, and reminder workflows

### What does not work yet

Open-ended complaint handling, multi-policy judgement calls, and any workflow that requires reasoning over conflicting source documents are not reliable in production. Vendors will demo them; production data shows they degrade fast on the long tail.

Cross-system write orchestration — where the agent has to write to two or three systems and reconcile partial failures — is the most common place agentic deployments break. Most contact centres do not have the API hygiene to support it, and the AI cannot fix that.

### The four-question procurement test

Before scoring an agentic voice AI on capability, gate it on these four. A 'no' on any one means the platform is a demo today, regardless of how the call sounded.

### How agentic changes the operating model

An agentic deployment shifts the operating-model centre of gravity from prompt and flow authoring to tool-contract management. The team that owns the deployment now owns the API surface the agent calls — schemas, versioning, deprecation, rate limits, error semantics. Most contact-centre teams do not own this; engineering does. Get the RACI explicit before signing, not after.

### Key takeaways

- Agentic = planning + tool use + recovery, not just better dialogue.
- It works today on bounded transactional intents with idempotent writes.
- It does not yet work for open-ended judgement-heavy calls.
- Gate procurement on four questions: real write, idempotency, audit, kill switch.
- The operating-model centre of gravity moves to tool-contract management.

### FAQs

**Is agentic voice AI just voice AI with better marketing?**

No. The substantive difference is tool use under planning — the agent decides which tool to call next based on what it observed from the last call, rather than following a pre-authored branch. The marketing is overheated, but the architecture is real.

**Do we need an MCP server to deploy agentic voice AI?**

Not necessarily. Most enterprise platforms in 2026 still expose tools via proprietary connectors or direct API integration. MCP adoption is rising but not a procurement gate yet.

**What is the biggest failure mode of agentic voice AI in production?**

Silent partial success — the agent reports the action as complete to the customer when one of the downstream writes failed. Idempotency keys and explicit per-step confirmation in the audit trail are the primary defence.

**Should we wait until agentic voice AI is more mature?**

Deploy non-agentic flows on broad intents and agentic flows on the two or three bounded use cases where the four-question test passes. Waiting for general-purpose agentic maturity means waiting indefinitely.

---

## Enterprise voice AI vendor comparison: 2026 buyer's guide

URL: /guides/voice-ai-vendor-comparison-2026
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Vendor comparison only works once you put each vendor in the right category. Comparing a contact-centre platform incumbent against a voice-AI-native start-up on the same matrix overweights capability and underweights the things that actually determine a five-year outcome: roadmap independence, integration depth, and the operating model the vendor implicitly forces on you.

### The four 2026 vendor categories

Every enterprise voice AI vendor in 2026 sits in one of four categories. The category drives the deal shape, the integration burden, the operating-model assumption, and the lock-in profile far more than the feature list does.

- Contact-centre platform incumbents — voice AI bundled into a CCaaS suite you may already own. Lowest integration burden, highest roadmap dependency.
- Voice-AI-native platforms — purpose-built for high-volume contained voice, usually with their own evaluation tooling. Best containment ceiling, requires deliberate integration work.
- Agent-builder toolkits — frameworks for assembling voice agents from components (STT, LLM, TTS, orchestration). Highest control, requires real engineering ownership.
- Telephony-led upstarts — strong on call quality, telco integration, and barge-in handling; weaker on enterprise governance and observability. Best fit for outbound and high-volume transactional inbound.

### How to compare within category

Score within category on the nine VERA dimensions: integration depth, latency, control surface, operating-model fit, observability, safety and compliance, voice quality, telephony and channel reach, and commercial model. Weight integration depth, operating-model fit, observability, and safety at roughly 60% of the total — these are the dimensions where production deployments succeed or fail.

Demo quality is a tie-breaker, not a primary axis. The narrowing of voice quality between platforms over the last 18 months means it should sit at no more than 10% of the score.

### How to choose between categories

The category choice is an operating-model choice. A CCaaS incumbent is the right answer if you do not want to own the evaluation and improvement loop and you can accept the roadmap dependency. A voice-AI-native platform is the right answer if containment ceiling and observability matter more than incremental integration cost. An agent-builder toolkit is the right answer if you have engineering ownership and want the lowest unit cost at scale. A telephony-led upstart is the right answer for outbound and bounded transactional inbound where call quality is the differentiator.

Choosing between categories on capability score alone produces the wrong answer about 70% of the time, because category determines who has to do the work — and that's the variable enterprise programmes most consistently mis-estimate.

### Questions to ask every vendor before shortlisting

Use this list to disqualify before the demo, not after. Any vendor that cannot answer all of these in writing within five business days is signalling something about their enterprise-readiness.

### The mistakes that recur

Three patterns show up in nearly every losing procurement. First, scoring the demo at 30%+ of the total — production behaviour on your call mix is what matters, not on the curated set. Second, treating 'integrations' as a logo count rather than read/write depth against your specific systems of record. Third, deferring the operating-model question to implementation, by which point the answer is whatever the vendor wants it to be.

### Key takeaways

- Group vendors into four categories before scoring — CCaaS incumbent, voice-AI-native, agent-builder, telephony-led.
- Within category, weight integration depth, operating-model fit, observability, and safety at ~60% of the score.
- Between categories, the choice is operating-model, not capability.
- Six pre-shortlist questions filter out most enterprise-immature vendors before the demo.
- Three vendors on the PoV is the right number — two is too narrow, five too shallow.

### FAQs

**Should we shortlist across categories or within one?**

Shortlist within category, then make the category choice deliberately. Cross-category shortlists tend to over-index on whoever demos best, which is rarely whoever performs best in production.

**How many vendors should be on the shortlist?**

Three is the right number for a defensible PoV. Two does not give you a real comparison; five dilutes the depth of evaluation each vendor gets.

**Is open-source voice AI an option for enterprise?**

For specific layers, yes — open-weight LLMs, open-source STT, and open-source orchestration are production-credible in 2026. The full open-source agent-builder stack is viable only for organisations with serious in-house engineering ownership.

**How long should a vendor comparison take end-to-end?**

Twelve to sixteen weeks: two weeks to scope and score the long list, two weeks to shortlist on written answers, six to eight weeks of PoV against real systems of record, and two weeks to decide.

---

## Legacy IVR replacement: migrating off Nuance-era platforms to modern voice AI

URL: /guides/nuance-ivr-replacement-migration
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Legacy IVR platforms — Nuance, Genesys-bundled equivalents, and other DTMF-plus-directed-dialogue stacks — do not migrate to modern voice AI by export. They migrate intent by intent, with a parallel run against the legacy flow as the safety net, and a measured containment and CSAT gate before each intent is cut over.

### Why lift-and-shift fails

Grammar-and-prompt IVR assets encode a different interaction model than a generative voice agent. Slot-filling grammars, recognition thresholds, and confirmation prompts assume a turn-based, recognition-first architecture. A generative voice agent expects to plan and converse. Porting the prompts produces a worse generative agent than starting from the intent specification.

The reusable artefacts are upstream of the IVR: the intent taxonomy, the call-driver analysis, the integration map, and the recorded call corpus. Those compound. The flow files themselves are largely throwaway.

### The four-phase migration

A defensible enterprise migration runs in four phases. The aggressive timeline is nine months for a focused programme; eighteen is realistic for a regulated enterprise with multiple business units.

### The per-intent cutover gate

Each intent gets cut over only when it passes a written gate. The gate is not 'the new flow works'; it is 'the new flow performs at parity or better on the metrics the business measures the legacy flow on.' Without a written gate, scope creep and political pressure decide cutover order, which is how migrations stall.

- Measured containment at or above legacy baseline on a representative call sample
- CSAT at or above legacy baseline over a minimum 14-day window
- 7-day re-contact rate at or below legacy baseline
- Integration error rate within agreed tolerance on the systems of record
- Audit trail evidenced for a sampled set of calls including any vulnerability or compliance edge case

### The operating-model shift

A legacy IVR is owned by a small team of specialists who change it monthly via a release process. A generative voice agent is owned by a cross-functional pod that ships changes weekly, monitors call-level metrics daily, and reacts to drift in hours. The org chart has to change before the technology does, or the new stack reverts to legacy operating cadence and stops improving.

The single most common reason 'modern voice AI' programmes underperform is that the operating-model change was deferred to phase two and then never happened.

### What to negotiate into the new contract

Use the migration as commercial leverage. The vendor wants the logo; you want the optionality.

- Exit assistance: a written commitment to export call data, transcripts, intent labels, and configuration in machine-readable form at any point during the contract.
- Per-resolution pricing on at least one intent: transfers containment risk to the vendor and produces a clean economic comparison against the legacy baseline.
- Operating-model SLA: defined response times for vendor-side issues that block your weekly release cadence, not just for platform incidents.

### Key takeaways

- Grammar-and-prompt assets do not port — the intent taxonomy and integration map do.
- Migrate intent by intent over 9–18 months, with a parallel-run per-intent gate.
- The operating model has to change before the technology does, not after.
- Per-resolution pricing on at least one intent transfers containment risk to the vendor.
- Keep the cross-platform router under your control — it is the kill switch.

### FAQs

**Can we keep the legacy IVR for some intents and move others?**

Yes, and most enterprise migrations end up there for at least 12–18 months. Hybrid is the realistic steady state during migration, not a failure mode.

**How do we route between the legacy and new platforms during migration?**

A thin router in front of both, configured per intent and per traffic percentage. Keep the router under your control, not the vendor's — it is the kill switch and the migration controller in one.

**What is the biggest cost surprise in a Nuance replacement?**

The integration layer. Legacy IVR integrations are often years old, partially documented, and tightly coupled. Rebuilding them for a modern voice agent commonly takes longer than building the voice flows themselves.

**Do we need to keep the legacy DTMF fallback?**

Yes, at least during migration and usually beyond it. DTMF remains the accessibility and reliability fallback for customers who cannot or will not interact with a voice agent.

**How do we measure migration success?**

Total cost per resolved call across both platforms, weighted by traffic, with CSAT and re-contact as gates. A migration that improves containment but degrades CSAT or increases re-contact has not succeeded.

---

## Voice AI DPIA template: a working data protection impact assessment

URL: /guides/voice-ai-dpia-template
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A voice AI DPIA is not optional under UK or EU GDPR. The processing is large-scale, automated, and frequently biometric — every one of those flags Article 35 individually. This template gives you the nine sections an ICO or DPC reviewer expects, with the evidence that has to back each one.

### Why voice AI requires a DPIA

Article 35 of UK GDPR (and the equivalent in EU GDPR) requires a Data Protection Impact Assessment whenever processing is likely to result in a high risk to the rights and freedoms of natural persons. Voice AI deployments trigger at least three of the ICO's listed criteria: large-scale processing, automated decision-making with significant effect, and — wherever voice biometrics are present — processing of special-category data.

The DPIA is not a launch artefact. It is a living document maintained by the data controller, refreshed at every material change to the processing, and produced on demand to a regulator. A DPIA that was signed off six months before launch and never updated is evidence of a failed governance process, not evidence of compliance.

### The nine sections a defensible DPIA contains

These map directly to the ICO's published template and to the EDPB's Article 35 guidance. Other jurisdictions add sections (CNIL adds a French legal-basis box; the Irish DPC adds a vendor-management appendix) but the core nine are universal for UK and EU voice AI deployments.

### Section-by-section: what to write

Description of processing: enumerate the data categories in the call path — caller phone number, transcript, intent labels, identifiers used for authentication, payment data if in scope, and any biometric voice template. State the source (caller, internal system, third party), the recipient (each sub-processor by name), and the retention period for each category.

Necessity and proportionality: do not write 'because it is cheaper'. The defensible answer is that voice AI handles a defined intent set faster than the human-only alternative, with measured quality, and that less-intrusive options (DTMF IVR, web self-service) were considered and have known coverage gaps for the in-scope use cases.

Lawful basis: contract performance (Article 6(1)(b)) covers most servicing calls; legitimate interest (6(1)(f)) covers measurement and quality monitoring with a documented balancing test; consent is rarely the right basis for inbound calls because it cannot be freely given when the customer needs the service.

Article 9 condition: only relevant when voice biometrics are used for authentication. The condition is normally 'explicit consent' (9(2)(a)) — and the consent flow has to be auditable per caller.

Article 22 analysis: a call that routes to a human, books an appointment, or quotes a balance is not Article 22 decisioning. A call that closes an account, declines a transaction, or denies a claim is — and requires either explicit consent, contractual necessity, or a legal basis plus a documented right to contest and human review.

### Risk register: the entries that have to appear

A DPIA without a risk register is not a DPIA. The entries below appear in almost every voice AI deployment; missing them is a sign the assessment was templated rather than worked.

### Sub-processor map: the artefact procurement actually uses

The DPIA's sub-processor map is the artefact that procurement, security, and the DPO all reach for. It lists every party in the call path — voice platform, STT provider, LLM provider, TTS provider, telephony carrier, recording vendor, analytics processor — with jurisdiction, transfer mechanism, and the contractual purpose limit applied to each.

A platform that cannot produce its sub-processor map without three weeks of internal coordination is a platform with an unmanaged data flow. Treat that gap as a procurement gate, not a DPIA footnote.

### Review cadence and triggers

The DPIA is refreshed annually as a calendar event, and off-cycle whenever any of the following changes: a sub-processor is added or replaced, the model provider changes, a new intent is added that introduces Article 22 decisioning, the retention period for any data category changes, or a near-miss incident occurs in production.

Each off-cycle refresh updates the risk register and the sub-processor map, and is countersigned by the DPO. Version-control the document — the regulator will ask for the history, not the latest copy.

### Key takeaways

- Voice AI triggers an Article 35 DPIA on at least three grounds — large scale, automated decisioning, and biometrics.
- The DPIA is owned by the controller, not the vendor — and is produced to the regulator on demand.
- Nine sections, with a risk register and a sub-processor map, are the minimum a defensible DPIA contains.
- Article 22 analysis is done per intent, not per platform — and the human-in-loop design follows it.
- Review is annual plus event-triggered; version-control the document because the regulator will ask for the history.

### FAQs

**Is a DPIA mandatory for every voice AI deployment?**

In practice, yes — under UK and EU GDPR. The processing is large-scale and automated, and frequently includes biometric or special-category data. Even where one criterion alone would not require it, the combination clearly does, and regulators have been explicit on this point in published guidance.

**Who owns the DPIA — the vendor or the controller?**

The data controller owns it. Vendors can and should provide DPIA inputs (sub-processor maps, data-flow diagrams, technical-measure descriptions), but the controller signs it off and produces it to the regulator. A vendor offering to 'do the DPIA for you' is misunderstanding the obligation.

**How long does a defensible DPIA take to produce?**

Four to eight weeks for a new deployment, assuming the vendor produces sub-processor and data-flow artefacts within the first two weeks. Most of the elapsed time is internal — DPO review, legal sign-off, risk-treatment decisions — not document drafting.

**What happens if we deploy without a DPIA?**

Under UK GDPR, the ICO can require the DPIA after the fact and, where high-risk processing has occurred without one, treat the omission as a compliance failure in its own right. The fine exposure is up to 2% of global turnover under Article 83(4). The reputational exposure is usually larger than the fine.

---

## EU AI Act voice AI classification: limited, high-risk, or out of scope?

URL: /guides/eu-ai-act-voice-ai-classification
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Voice AI under the EU AI Act sits in one of three buckets: out of scope, limited-risk (transparency duty only), or high-risk (full Annex III regime). The bucket is decided per use case by what the AI is doing, not by which vendor sold it. Most servicing deployments are limited-risk; biometric authentication and eligibility decisioning are not.

### The three classifications that matter

The EU AI Act creates four risk categories — prohibited, high-risk, limited-risk, and minimal-risk — but voice AI in enterprise contact centres realistically sits in only three of them. Prohibited practices (real-time biometric identification in public spaces, social scoring) do not apply to inbound customer service.

Limited-risk is the default for most servicing use cases. The obligation is narrow but specific under Article 50: the deployer must inform the natural person that they are interacting with an AI system, unless this is obvious from the context. 'Obvious from the context' is a higher bar than vendors often suggest — a voice that sounds human, in a channel a human would normally answer, requires disclosure.

High-risk is the regime that changes the economics of the deployment. The full conformity assessment, technical documentation, post-market monitoring, and EU database registration apply.

### When voice AI becomes high-risk: Annex III triggers

Annex III lists eight high-risk use case families. Four of them are routinely triggered by voice AI deployments. Read each one literally — the language is the test, not the marketing.

### Roles: provider, deployer, distributor

The Act distinguishes three roles. The vendor that develops and places the voice AI platform on the market is the provider. The enterprise that uses it under its own authority is the deployer. The reseller or system integrator that supplies it without modification is the distributor.

Most obligations sit with the provider — conformity assessment, technical documentation, EU database registration. Deployer obligations are narrower but real: human oversight, monitoring use against the provider's instructions, logging, transparency to affected persons, and (for high-risk systems serving public functions or essential services) a fundamental rights impact assessment.

A deployer that substantially modifies a high-risk system — for example, by replacing the underlying LLM with one not covered by the original conformity assessment — becomes a provider in its own right for that modified system. This is the trap most often missed in build-on-platform deployments.

### Limited-risk: the transparency duty in practice

For limited-risk deployments, the obligation under Article 50 is to inform the natural person that they are interacting with an AI system, at the start of the interaction, in a clear and distinguishable manner.

What this means in the call: a disclosure within the first turn ('You're speaking with an automated assistant') that is unambiguous and not buried in legal text. A voice that introduces itself by name without disclosing its nature does not meet the duty. A disclosure offered only on request does not meet the duty. The disclosure is owed to the caller, not to the deployer's legal team.

### Documentation: what each tier requires

Limited-risk deployments need: a transparency disclosure record, a basic data-protection assessment that overlaps with the GDPR DPIA, and a description of the use case kept on file.

High-risk deployments need: the provider's technical documentation under Annex IV (supplied to the deployer), a risk management system, data and data governance documentation, human oversight design, accuracy and robustness evidence, logging that supports post-market monitoring, and — for deployers in scope of Article 27 — a fundamental rights impact assessment before first use.

- Transparency disclosure record (limited-risk and above)
- DPIA / FRIA alignment under UK/EU GDPR (all tiers)
- Provider technical documentation under Annex IV (high-risk)
- Human oversight design with documented authority to intervene (high-risk)
- Logging and post-market monitoring plan (high-risk)
- EU database registration where the deployer is in scope (high-risk public-function deployments)

### Timelines and enforcement

The Act's obligations phase in. Prohibitions applied from February 2025; general-purpose AI obligations from August 2025; high-risk obligations under Annex III apply from August 2026, with the harmonised-standards regime for high-risk products applying from August 2027.

Enforcement is by national competent authorities, with fines up to 7% of global turnover for prohibited-practice breaches, 3% for most other obligations, and 1.5% for supplying incorrect information to authorities. The fine exposure is meaningful; the reputational exposure is larger.

### Key takeaways

- Most servicing voice AI is limited-risk — the obligation is a clear AI disclosure at the start of the call.
- Annex III triggers (biometrics, essential services, credit, employment) push the deployment into high-risk.
- Classification is per use case, not per platform — and the deployer makes the call.
- Substantially modifying a high-risk system makes the deployer a provider in its own right.
- Annex III obligations apply from August 2026 — the preparation runway is measured in months, not years.

### FAQs

**Is a customer service voice AI automatically high-risk?**

No. A voice AI that books appointments, answers status queries, or takes payments is limited-risk under the Act. It only becomes high-risk when the use case falls within an Annex III category — biometric identification, essential-services eligibility, creditworthiness, employment, and a small number of others.

**Does the transparency duty mean the AI cannot sound natural?**

No. The Act is silent on voice naturalness. It requires that the caller be informed that they are interacting with an AI, in a clear and distinguishable manner, at the start of the interaction. A natural-sounding voice that introduces itself as an automated assistant meets the duty.

**Who is responsible — the platform vendor or our company?**

Both, in different roles. The vendor is the provider and carries the conformity, documentation, and registration obligations for the platform. Your company is the deployer and carries the transparency, monitoring, human-oversight, and (where in scope) fundamental-rights-impact-assessment obligations for how you use it. Neither role can be contracted away.

**When do the obligations actually bite?**

Limited-risk transparency obligations apply now. High-risk obligations under Annex III apply from August 2026 for new deployments and from August 2027 for systems already in operation at that date. Preparation timelines are tight if a deployment touches an Annex III category.

---

## PCI DSS v4.0 and voice AI: keeping cardholder data out of the model

URL: /guides/pci-dss-v4-voice-ai
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** The single deployment decision that determines PCI scope for voice AI is whether the cardholder PAN ever touches the LLM context window. If it does, every model provider, telephony carrier, and recording vendor in the call path is in scope and the architecture has to satisfy the full DSS v4.0 control set. Pause-and-resume DTMF, done properly, keeps the AI out of scope.

### The PCI DSS v4.0 changes that matter for voice AI

PCI DSS v4.0 replaced v3.2.1 as the only effective standard from 31 March 2024, with future-dated requirements becoming mandatory from 31 March 2025. Three changes materially affect voice AI deployments: customised approach (Requirement 12 lets entities meet objectives via non-prescriptive controls if rigorously documented), expanded scope for third-party service providers (Requirement 12.8 and 12.9), and explicit treatment of cardholder data in non-traditional channels.

The headline implication for voice AI: the standard does not contemplate cardholder data flowing through an LLM. There is no exception, no carve-out, and no 'AI annex'. If a PAN reaches the model, the model provider is a Level 1 service provider and the whole call path is in audit scope.

### Pause-and-resume DTMF: the only defensible pattern

The standard pattern that keeps voice AI out of PCI scope is pause-and-resume DTMF capture. The AI conducts the conversation up to the point of payment, hands the call to a PCI-scoped capture service that accepts DTMF tones (the customer types card digits on the keypad), then resumes once the transaction is authorised.

Done properly, the AI never hears the digits — neither the audio nor the transcription. The DTMF tones are intercepted upstream of the speech-to-text path, the call is masked to the AI, and the only signal the AI receives back is a transaction outcome (authorised / declined / referred). The pattern has been the standard for human-agent contact centres for over a decade; voice AI inherits it without modification.

- DTMF capture service is a separately attested PCI Level 1 provider
- Audio path to the AI is muted during capture — no STT, no recording
- Tone-suppression is verified by a third-party penetration test, not vendor assertion
- The AI receives only the transaction outcome, never the PAN, expiry, or CVV
- Recording, where retained, is split so the masked period is unrecoverable from the retained audio

### What pulls voice AI back into scope

Three deployment patterns silently put the AI back in PCI scope and most procurement teams miss at least one.

First: 'we'll redact the PAN from the transcript after the call'. Redaction after the fact does not reduce scope. The PAN was present in the LLM context at processing time. The model provider, the STT provider, and any logging pipeline are in scope from the moment the digits hit the prompt.

Second: 'the AI reads the digits back to confirm'. Read-back means the TTS provider now handles cardholder data, which expands scope to the TTS and synthesis pipeline. The defensible pattern is for the DTMF capture service to confirm out-of-band (visual confirmation in a web channel, or human-agent confirmation), never the AI.

Third: 'we capture by voice when DTMF fails'. Voice fallback for cardholder data eliminates the scope reduction entirely. If voice fallback is necessary for accessibility, route the call to a PCI-scoped human-agent flow, not to the AI.

### Sub-processor and SAQ obligations

Even with pause-and-resume DTMF correctly implemented, the merchant is still responsible for documenting which parties touch cardholder data and obtaining their PCI attestation. Requirement 12.8.5 requires a maintained list of service providers, their PCI scope, and the controls they cover.

For a typical voice AI deployment, the in-scope service-provider list is the telephony carrier (always in scope — they carry the audio), the DTMF capture provider (in scope, with full Level 1 attestation), and the payment gateway. The AI platform, STT, LLM, and TTS providers are out of scope — and their out-of-scope status is documented in the network and data-flow diagrams that accompany the merchant's SAQ.

The SAQ that applies depends on the merchant's payment volume and channel mix; SAQ A or A-EP is typical for properly scoped pause-and-resume deployments. A QSA-led Report on Compliance replaces the SAQ at Level 1 merchant volumes.

### Controls QSAs actually test

A QSA reviewing a voice AI deployment for PCI v4.0 will test a recurring set of controls. Knowing the list before the audit is the difference between a clean report and a remediation cycle.

### Key takeaways

- The single architecture decision that determines PCI scope is whether PAN touches the LLM context.
- Pause-and-resume DTMF capture, properly implemented, keeps the AI out of scope.
- Post-hoc redaction does not reduce scope — the data was present at processing time.
- Voice fallback for cardholder data destroys the scope reduction; route to a human or specialised provider instead.
- The QSA tests data-flow diagrams, segmentation, tone-suppression evidence, and sub-processor AOCs — have all four ready.

### FAQs

**Can voice AI be PCI compliant by itself?**

PCI compliance is a property of a deployment, not of a vendor or product. A voice AI platform can be deployed in a PCI-compliant architecture (pause-and-resume DTMF, AI out of scope) or in a non-compliant one (PAN in LLM context). The vendor's marketing is irrelevant to your assessment.

**What if our use case requires voice payment capture?**

Route the voice-payment portion of the call to a human agent on a PCI-scoped path, or to a specialised voice-payment provider that holds its own Level 1 attestation. Do not capture by voice through a general-purpose AI platform — the scope expansion makes the deployment uneconomic and the regulator exposure is asymmetric.

**Does call recording need to change?**

Yes. The recording must be split, paused, or masked during the DTMF capture window so the retained audio does not contain recoverable cardholder data. Most enterprise recording platforms support this; the configuration has to be tested, not assumed.

**What changes under v4.0 versus v3.2.1 for voice AI?**

Three things: the customised approach lets you meet some objectives via non-prescriptive controls (useful for novel AI architectures, with strict documentation), service-provider obligations under 12.8 and 12.9 tightened materially, and targeted risk analysis under 12.3.1 is now required for any control where flexibility is exercised. None of this changes the core architecture decision: keep PAN out of the model.

---

## Voice AI RACI: programme governance that survives quarter two

URL: /guides/voice-ai-raci-programme-governance
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Most voice AI programmes have a launch RACI and no operating RACI. The launch RACI gets the platform live; the operating RACI keeps it useful. The two are different documents owned by different people, and the second one is what determines whether the programme exists in a year.

### Why voice AI breaks the standard contact-centre RACI

A traditional contact-centre RACI assigns clear ownership of channels (CX), technology (IT), and compliance (Legal / Risk). Voice AI sits across all three in a way that previous self-service technologies did not. Prompts and flows are an operational asset (CX), the model and integration layer is engineering (IT), the data and decisioning are subject to GDPR and the EU AI Act (Legal / DPO), and the platform contract creates new third-party risk obligations (Procurement / IT-Sec).

Treating any one of these as the primary owner produces a programme that the others cannot work with. The defensible pattern is shared accountability with explicit decision rights, written down before launch and reviewed every quarter.

### The fifteen recurring decisions to assign upfront

These are the decisions that recur, often unannounced, in every voice AI programme. Each one needs an A (accountable, single name), supporting Rs (responsible), and explicit Cs and Is. Leaving any of them unassigned means the decision will be taken by whoever escalates loudest.

### The accountable owner of the operating model

Across every successful enterprise voice AI deployment, one role exists that is sometimes missing from the launch project: the Conversation Owner. This is the named individual in CX Operations who owns the prompt, flow, and intent taxonomy in production, has authority to change them on a defined cadence, and is measured on the operating outcomes (containment, CSAT, re-contact) rather than on engineering velocity.

If this role does not exist, every change becomes an engineering ticket and the deployment stagnates at the version that shipped at launch. The Conversation Owner sits in CX Ops, not in IT, and has tooling (a controlled editor with diff review, staging, and one-click rollback) that lets them make changes without writing code.

### Cadences that hold the programme together

Three regular meetings, each with a named chair and a published agenda, are the minimum operating cadence. Drop any of them and the programme drifts.

### What the RACI does not include

The RACI is silent on commercial relationship management beyond renegotiation. It is silent on routine vendor communication. It is silent on PR and external communication about the programme. These belong to the sponsor and the communications function and are deliberately kept out of the operating RACI so that engineering and CX Ops can run the programme without being interrupted by external coordination.

Similarly, the RACI does not own brand voice or tone-of-voice decisions on the AI's outputs. Those belong to Marketing / Brand, who is consulted on conversation design but is not in the operating chain. Mixing brand decisions into the operating RACI is how voice AI programmes acquire six-week approval cycles for prompt changes.

### Key takeaways

- Voice AI sits across CX, IT, Legal, and Procurement — single-owner RACIs do not work.
- Fifteen recurring decisions need a named Accountable before launch, not after the first incident.
- The Conversation Owner role in CX Ops is the most frequently missing role and the most operationally damaging gap.
- Three cadences — weekly ops, monthly platform, quarterly programme — are the minimum that holds the programme together.
- Review the RACI every quarter; workarounds are evidence the document is wrong, not that the team is undisciplined.

### FAQs

**Who should be the Sponsor — CX, IT, or the COO?**

Whoever owns the P&L the programme is justified against. In customer-service-led deployments that is usually the CX Director or VP. In efficiency-led deployments it is more often the COO. The wrong answer is to put sponsorship in IT — voice AI is an operating change, not a technology refresh.

**Does the vendor sit on the RACI?**

The vendor appears as Responsible on platform-change decisions and as Consulted on roadmap and incident decisions, but is not Accountable for anything. Accountability that leaves the enterprise creates a governance gap that auditors and regulators will identify.

**How often should the RACI itself be reviewed?**

Every quarter, as part of the programme review. The test is simple: did the decisions taken in the last quarter map cleanly to the document, or did people work around it? Workarounds are evidence the RACI is wrong, not evidence the team is undisciplined.

**What is the most common gap in voice AI RACIs?**

The Conversation Owner role. The launch RACI usually has Engineering owning everything, then the launch project ends and nobody is named for the operational ownership of prompts and flows. Six months later the deployment has not changed since launch and nobody can say why.

---

## 2026 enterprise voice AI benchmark report: framework with illustrative numbers

URL: /guides/2026-enterprise-voice-ai-benchmark-report
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A 2026 enterprise voice AI benchmark only earns the name if it states its definitions, its denominators, and its sample. This report is a framework — published definitions, a measurement protocol, and illustrative numbers that show what defensible looks like on each axis. Compare your own measurements against it; do not adopt these numbers as your own.

### What this report is — and what it is not

This is a vendor-neutral framework report. It documents the seven dimensions on which an enterprise voice AI deployment should be benchmarked, the measurement protocol for each, and illustrative numbers that show what a defensible 2026 result looks like.

The numbers are explicitly illustrative. They are triangulated from public case studies, regulator filings, practitioner reports, and conversations across roughly 40 enterprise deployments observed between 2024 and 2026. They are not a primary research dataset. They will be wrong for any individual deployment in either direction — sometimes by 10 points, occasionally by 30. That is the point: a number outside these bands is a flag to investigate, not evidence to celebrate or panic.

The framework, by contrast, is intended to be adopted verbatim. Definitions, denominators, and measurement windows are the parts that survive across deployments; the numbers are decoration over the top of them.

### The seven dimensions

Most enterprise voice AI benchmarks fail because they report a single headline (containment, usually) without the supporting structure that makes it interpretable. A defensible benchmark covers seven dimensions, with the relationships between them explicit.

### Dimension 1 — Containment by intent

Containment is the most-quoted and least-defined number in the category. The defensible measurement is gross containment (calls not escalated divided by in-scope calls) and net containment (gross minus calls that re-contacted within 7 days for the same intent). Quote both or quote neither.

The bands below are illustrative — your number will sit somewhere on each row depending on integration depth and intent mix. The shape of the table, not the individual values, is the durable part.

### Dimension 2 — Autonomous resolution rate

Autonomous resolution rate is the stricter cousin of containment: contained calls minus calls that re-contacted for the same intent within the window, divided by total in-scope calls. It is harder to measure, smaller than containment, and much closer to what the customer would say if asked.

The illustrative gap between containment and autonomous resolution rate runs 8–22 percentage points in 2026 production deployments, with the largest gaps on emotional or multi-step intents and the smallest on clean transactional ones.

### Dimension 3 — End-to-end latency

Latency is measured end-to-end from end-of-turn detection to first audio frame returned, under realistic load (peak hour, real integrations, no warmed caches). Demo-condition numbers are not in scope for a benchmark.

The 2026 production band sits at 600–1800 ms end-to-end. Stacks above 2000 ms see measurable disengagement on inbound calls. The largest single contributor in most deployments is not the model — it is integration calls on the critical path.

### Dimension 4 — Cost per resolved call

Cost per resolved call is the unit that matters; cost per call and cost per minute are inputs to it. The defensible 2026 model includes pre-transfer AI minutes, post-transfer handle-time penalty, 7-day re-contact, and the operating-model labour line of £150k–£400k per year that vendor ROI decks routinely omit.

Illustrative numbers, blended across deployments: a balance-status intent lands at £0.40–£1.00 cost per resolved call; a billing-question intent at £1.50–£3.50; a dispute or chargeback at £4.00–£12.00 once the operating model is amortised across the call mix. These are wide bands deliberately — the inputs vary by a factor of three across deployments and any narrower figure is false precision.

### Dimension 5 — Integration depth

Integration depth is the single biggest predictor of whether a deployment resolves calls or only talks about them. The benchmark is read/write coverage against the systems of record in scope, scored on a four-level scale.

### Dimension 6 — Operating-model maturity

Operating-model maturity is what keeps the deployment useful after launch. The measurement protocol is cycle-time from an identified improvement idea to that improvement running in production, measured across the last 90 days of changes, with ownership clarity for the Conversation Owner role.

Illustrative cycle-time bands: stage-one programmes (engineering-ticket-for-every-change) sit at 6–12 weeks. Stage-two programmes (Conversation Owner with a controlled editor) sit at 5–10 working days. Stage-three programmes (with staging, diff review, and one-click rollback) sit at 1–3 working days.

### Dimension 7 — Compliance posture

Compliance posture is binary at the gate level and graded across the operating measures. The gate items: a current DPIA refreshed within the last 12 months, an explicit EU AI Act classification per use case, a sub-processor map signed off in the last six months, and a PCI scope diagram dated within the last 12 months where cardholder data is in scope.

Operating measures: time-to-respond on a subject access request involving voice AI data, time-to-disclose a sub-processor change, and the count of off-cycle DPIA refreshes triggered in the last year (zero is usually a warning sign, not a celebration).

### How to use this report

Read each dimension's framework. Adopt the definitions, denominators, and measurement windows verbatim. Run the measurement against your own deployment. Compare your number to the illustrative band and ask what would explain a result outside it.

Do not lift these numbers into your own board pack as if they were primary data — they are not. Do lift the framework, because the framework is what makes the comparison defensible across deployments and across years.

An annual refresh of your own measurements, against this framework, is more valuable than any single benchmark number quoted in a vendor deck.

### Key takeaways

- This is a framework report — numbers are illustrative, definitions are durable.
- Seven dimensions cover containment, autonomous resolution, latency, cost per resolved call, integration depth, operating-model maturity, and compliance posture.
- Gross-to-net containment gap of 15–30 points is normal in 2026; smaller gaps usually mean re-contact is not being counted.
- Integration calls dominate end-to-end latency in most production stacks — not the LLM.
- Operating-model maturity (cycle-time from idea to production) is the dimension most often missing from internal benchmarks.

### FAQs

**Are these numbers from primary research?**

No. They are illustrative bands triangulated from published case studies, regulator filings, practitioner observations, and conversations across roughly 40 enterprise deployments. They are clearly labelled illustrative throughout. The framework — definitions, denominators, measurement windows — is the durable contribution; the numbers are decoration.

**Can I cite this report in a vendor evaluation?**

Cite the framework — the seven dimensions and the measurement protocol — not the specific numbers as if they were your own measurements. The illustrative bands are useful as a sanity-check against your measured results, not as a substitute for measuring.

**How often should we re-benchmark our deployment?**

Quarterly on operational dimensions (containment, latency, cost per resolved call), annually on compliance and integration depth, and whenever a material change to the stack or sub-processor list occurs.

**What is the single dimension most often missing from internal benchmarks?**

Operating-model maturity. Programmes that benchmark containment and latency monthly often have no measurement at all for cycle-time from idea to production — which is the single best predictor of whether the deployment will still be useful in 12 months.

---

## Conversational IVR: defined, compared, and where it fits in 2026

URL: /guides/conversational-ivr-defined-and-compared
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Conversational IVR is a telephony interface that lets a caller speak naturally to a system that maps utterances to pre-defined intents and slots, rather than typing them on a keypad. It is not the same as an autonomous voice agent: it follows a structured workflow rather than dynamic reasoning, and its containment ceiling is correspondingly lower.

### What conversational IVR actually means

Conversational IVR is the telephony layer above touch-tone, below an autonomous voice agent. The caller speaks in their own words; the system uses ASR plus an NLU model to map the utterance to a pre-defined intent and the slots that intent needs. The dialogue follows a structured graph the design team authored — it does not reason its way through new situations.

The label is older than the current capability. 'Natural-language IVR' from the mid-2010s was the same idea executed with the speech and NLU stack of the time, and most of the bad memories enterprises carry forward come from that generation. Modern conversational IVR shares the architecture but runs on streaming ASR, LLM-backed NLU, and sub-second turn-taking — which makes it materially more usable than the systems it replaces.

### The four-rung automation ladder

Voice automation is best read as a ladder. Each rung adds flexibility and lowers the latency budget; each one also expands the integration and governance burden. Conversational IVR is the third rung — a meaningful upgrade on touch-tone, a meaningful step short of an autonomous agent.

### Where conversational IVR fits in 2026 — and where it doesn't

Conversational IVR is the right answer where the workflow is deterministic, the intent set is bounded, and the cost of a non-deterministic response is high. Account-balance lookups, payment-status enquiries, appointment confirmations, claims first-notification, and outage reporting all fit comfortably.

It is the wrong answer where the caller's question is advisory, where the answer depends on synthesising several documents, or where empathy and acknowledgement are part of the resolution. Trying to push conversational IVR into those flows is the single most common cause of programmes that contain at 28% and stall.

### Architecture: what changed since the 2015 stack

The label is unchanged; the implementation is not. A modern conversational IVR shares almost no components with its 2015 ancestor.

- Streaming ASR — partial transcripts arrive while the caller is still speaking, instead of batched after a silence detector fires
- LLM-backed NLU — semantic intent matching replaces brittle keyword or regex rules; mid-utterance corrections are recoverable
- Graph-based dialogue management — flows are authored as graphs with branching and back-off, not finite state machines
- Sub-700ms end-to-end turn budget — the round trip from end-of-user-speech to start-of-system-speech has to clear ~700ms for the interaction to feel natural
- Barge-in by default — callers can interrupt the prompt without losing context, which is table stakes in 2026 but absent in most legacy systems

### The honest containment ceiling

Vendor decks routinely show conversational IVR containment in the 60–80% range. In production at enterprise scale, on a representative call mix, the realistic ceiling is 25–45% on transactional intents and lower on advisory ones. The gap between the deck and reality is almost always intent coverage: the demo handles the head; the production traffic includes the long tail.

Two failure modes account for most of the disappointment. The first is the 'unknown intent' bucket growing past 20% of traffic — the NLU is not at fault, the intent map is incomplete. The second is repeat contact: containment looks good in the IVR but the same caller returns within 24 hours to a human, which means the system contained the call without resolving it.

### When to upgrade to a full voice AI agent

The signal that you have outgrown conversational IVR is not a single metric. It is a pattern across four:

### Procurement: what to actually test in the demo

Vendor demos are choreographed. The decisions that matter happen when the choreography breaks. Insist on testing on your own audio, not the vendor's, and on the following:

- Barge-in on a long prompt — does the system stop cleanly or stutter?
- Noisy line — café noise, traffic, hands-free in a car: how does ASR degrade?
- Accent and code-switching on your actual customer audio, not vendor reference clips
- Slot recovery — caller gives half a postcode, half a date: does the system request the missing half cleanly?
- DTMF fallback — can a caller in a noisy environment switch to keypad without losing context?
- Mid-call topic change — caller starts on billing, switches to a service request: graceful or restart?

### Key takeaways

- Conversational IVR maps spoken intent to a pre-defined graph; it is not autonomous reasoning
- The four-rung ladder is DTMF → directed dialogue → conversational IVR → voice AI agent
- Realistic containment ceiling is 25–45% on transactional intents; long-tail intents are the limit
- Modern stacks require streaming ASR, LLM-backed NLU, and a sub-700ms turn budget
- Demo evaluation must include barge-in, noisy lines, accents, slot recovery, and DTMF fallback

### FAQs

**What is the difference between IVR and conversational IVR?**

Standard IVR uses touch-tone keypad input mapped to a fixed menu tree. Conversational IVR uses ASR and NLU so the caller can speak the request, and supports multi-slot filling in a single turn.

**Is conversational IVR the same as a voice AI agent?**

No. Conversational IVR maps speech to a pre-defined intent and follows an authored graph; a voice AI agent reasons across the available context and tools. The architecture, containment ceiling, and governance footprint differ accordingly.

**What containment rate should I expect from conversational IVR?**

On a representative enterprise call mix, 25–45% on transactional intents is the realistic range. Higher numbers are usually quoted on a subset (e.g. the top three intents) and do not survive contact with the long tail.

**Does conversational IVR require an LLM?**

Not strictly, but in 2026 LLM-backed NLU is the de facto standard. Intent recognition accuracy, mid-utterance correction, and slot recovery all improve materially against pre-LLM models.

**When should we replace conversational IVR with a full voice AI agent?**

When unknown-intent rate clears 20%, re-prompts per call clear 1.5, and recontact-within-24-hours exceeds your baseline. Those four together mean the architecture, not the configuration, has hit its ceiling.

**Can conversational IVR handle outbound calling?**

Yes, but the binding constraint is rarely the technology — it is the regulatory regime in the region you are calling into. Consent, DNC, and disclosure rules vary sharply.

---

## Voice AI platform pricing models in 2026: the enterprise buyer's guide

URL: /guides/voice-ai-platform-pricing-models-2026
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** Voice AI pricing in 2026 is moving from 'all-in per minute' to unbundled component pricing and outcome-based resolution models. To avoid overspending, model the hidden cost stack and the contract terms — not just the unit rate. The quoted number is rarely the number you pay.

### The five pricing models you'll see in 2026 RFPs

The market has converged on five recognisable structures. Most enterprise quotes are a blend of two of them.

### Per-minute pricing: what's actually included

A headline per-minute number is almost always partially unbundled in enterprise quotes. The published rate typically covers a baseline ASR, baseline TTS, and a default LLM. The line items that get added on top are what surprise procurement late in the cycle.

- Premium voices and language packs — frequently a 20–30% surcharge over the baseline TTS
- Higher-capability LLM models for harder reasoning — surcharged per minute or per token
- Prompt-token volume — long retrieval contexts and large knowledge bases meter against a separate token allowance
- Observability and analytics seats — real-time dashboards, conversation review, and QA tooling priced per user
- Telephony — almost never included in the AI rate; either pass-through-at-cost or a markup
- Recording storage and transcription retention — per-GB monthly and per-call retrieval pricing
- Sandbox / UAT environments — many platforms meter non-production usage separately

### Per-resolution pricing: the definition is the negotiation

Outcome-based pricing aligns the vendor with the business case, but the word 'resolution' is doing a lot of work in the contract. Four definitions dominate the market, each with its own loophole.

### The hidden cost stack

Five line items account for most of the gap between quoted unit rate and realised cost. Together they routinely add 15–25% to the bill.

- Telephony pass-through — PSTN, SIP trunking, and international termination; rarely included in the AI rate
- Recording storage and retention — per-GB monthly plus per-retrieval, multiplied by your compliance retention window
- SSO, SCIM, and audit-log access — frequently behind an 'Enterprise' or 'Pro' tier
- Sandbox, UAT, and dedicated tenant — non-production environments metered separately
- Professional services minimums — implementation, on-going customer-success retainer, and integration hours

### Token and model pass-through pricing

Some platforms expose the underlying LLM cost directly and add a platform margin; others absorb it into a bundled per-minute rate. The trade-off is predictability versus exposure.

Pass-through pricing usually quotes lower nominal rates and shifts FX and model-provider rate-card risk to the buyer. If the underlying model provider re-prices or changes its tokenisation, your unit economics move with it. Build a quarterly review and a budget buffer; do not assume the rate at signature is the rate at month nine.

### An honest TCO worked example

The same call volume across three pricing structures produces materially different annual bills. The example below uses one million minutes per year, an average call duration of four minutes (250,000 calls), and a 60% resolution rate where relevant.

### Contract terms that move the number more than the unit rate

Below the headline rate sit the levers that decide what you actually pay. Negotiate these first; negotiate the unit rate last.

- Volume commits and rollovers — annualised pools beat monthly use-it-or-lose-it by a wide margin
- Ramp curves — committed minimums should scale with your deployment phases, not your contract start date
- Most-Favoured-Nation clauses — material in a market where compute costs are falling year-on-year
- Price-protection windows — cap on annual increases, with a clear basis for any pass-through changes
- Burst overage rates — confirm the per-minute rate that applies above your committed pool
- Exit and data portability — extraction of prompts, voice clones, conversation logs, and tuning data, with timelines and fees defined
- Sub-processor change notification — N days' notice and a right to refuse for material changes

### The seven questions that flush out the real number

Send these in the RFP, not in the negotiation. The answers shape the comparison; chasing them late wastes weeks.

### Key takeaways

- Five pricing structures dominate in 2026: per-minute, per-session, per-resolution, per-seat-equivalent, platform-plus-usage
- Unbundled per-minute rates almost always exclude premium voices, higher-capability LLMs, telephony, observability seats, and sandboxes
- Per-resolution pricing's binding term is the definition of 'resolution' — five common definitions, each with a different loophole
- Contract terms (commits, ramps, MFN, exit) typically move TCO by 30–60% versus the quoted unit rate
- Model TCO at your real volume, mix, and retention window — not at the vendor's reference example

### FAQs

**How do voice AI per-minute rates differ from telephony costs?**

The voice AI rate covers ASR, LLM inference, TTS, and platform compute. Telephony (PSTN and SIP trunking) is the carrier cost of the call itself, and is usually billed separately — either passed through at cost or marked up by the AI platform.

**What is a typical enterprise platform fee range?**

For enterprise-grade platforms with SSO, observability, and dedicated environments, annual platform fees commonly run £50,000–£250,000 in 2026, depending on security posture, integration scope, and support tier. The fee replaces some per-minute margin, not all of it.

**Should I prefer per-minute or per-resolution pricing?**

Per-minute is more predictable and almost always cheaper at low containment. Per-resolution becomes attractive at higher containment and on intents where 'resolution' is contractually well defined. Below 40% containment, per-resolution is usually more expensive on the same volume.

**Are LLM token costs included in the per-minute rate?**

In 2026, both bundled and pass-through models exist. Bundled rates absorb the LLM cost into the per-minute number and give predictability; pass-through rates quote lower nominal numbers and expose the buyer to model-provider rate-card changes.

**How important are MFN and price-protection clauses?**

Material. Voice AI compute costs are falling year-on-year; without an MFN or a cap on increases, a multi-year contract locks in today's rate while the market re-prices around it. Insist on annual review with a defined adjustment basis.

**What is the most-overlooked cost in voice AI procurement?**

Recording and transcription retention against the buyer's actual compliance window. A seven-year retention obligation in financial services multiplies the storage line in ways most TCO models understate at signature.

---

## AI call centre software in 2026: a vendor-neutral buyer's guide

URL: /guides/ai-call-center-software-buyers-guide-2026
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** AI call centre software is not one product category. Conversational IVR, agent-assist, and autonomous voice agents are three different procurements with three different ROI profiles. A vendor-neutral evaluation names which category you are buying first, then scores against integration depth and observability — not feature counts.

### Three categories the marketing pages blur together

Every vendor calls itself 'AI for the contact centre'. Past the homepage, the actual products fall into three distinct categories with different ROI profiles, different integration burdens, and different risk surfaces. Naming the category you are buying is the prerequisite to a defensible shortlist.

### Capability tiers — what 'AI' actually means in each tier

Below the category, the capability tier determines what the system can do without a human in the loop. The cheapest mistake is paying tier-three prices for tier-one capability; the most expensive is the reverse.

### Integration depth is the real moat

Demos compare voice quality and latency because those are the cheapest things to demo. Production deployments live or die on the depth of integration against the systems of record. A platform that cannot do bidirectional writes against the CRM, case-management, and billing systems is a containment ceiling pretending to be a product.

- Read-only against CRM is a demo; bidirectional write-through is a deployment
- Identity verification has to compose with the existing IAM stack, not replace it
- Case-management write-through has to include reason codes, transcript, and disposition — not just 'AI handled this'
- Telephony integration must support warm transfer with context payload, not blind transfer
- Real-time event streaming to the data warehouse is table-stakes for any QA or product-analytics use

### Observability and audit — the under-scored axis

Most scorecards over-weight voice quality and under-weight observability. In production, the team that owns the deployment spends more time reading transcripts, diffing prompt changes, and exporting evidence for compliance than tuning prosody. Buy for the operating model, not the demo.

- Full-fidelity transcript and tool-call trace per conversation, queryable by intent and outcome
- Prompt and flow versioning with diff review, staging, and one-click rollback
- Reason-code tagging that the contact-centre team can change without an engineering ticket
- Export of evidence packs for regulators — transcripts, decisions, model versions, prompts at time of call
- SCIM provisioning, SSO, and audit log access — not behind a separate 'Enterprise' SKU

### A vendor-neutral scoring rubric

Score each shortlisted platform against the eight dimensions below on a 1–5 scale, weighted by deployment phase. Capability and integration carry more weight in tier-three buys; observability and contract terms carry more weight at every tier than most shortlists give them.

### The seven questions that separate marketing from architecture

Send these in the RFP. Vendors that handle them cleanly belong on the shortlist; vendors that route them to a follow-up call rarely improve in the next round.

### Red flags that should drop a vendor from the shortlist

Some answers are signals on their own. Any of the below should trigger a hard conversation before the next round, not after contract signature.

- Containment benchmarks quoted without naming the intent mix, call sample, or measurement window
- PCI compliance claimed without a documented pause-and-resume DTMF pattern for cardholder data
- 'Enterprise SSO' priced as a separate SKU above the platform fee
- References that are all pilots and PoCs, with no production deployment at comparable scale
- Sub-processor list that omits the underlying LLM provider, or refuses to disclose it
- Operating model where every prompt change requires an engineering ticket

### Key takeaways

- AI call centre software is three categories — conversational IVR, agent-assist, autonomous voice agent — not one
- Capability tier (scripted NLU, retrieval-augmented, tool-using agent) decides what runs without a human
- Integration depth against the systems of record is the real moat, not voice quality or latency
- Observability, prompt versioning, and evidence export carry more weight in production than demo features
- Eight-dimension scoring rubric beats feature-checklist procurement at every deployment tier

### FAQs

**Is AI call centre software the same as a contact centre as a service (CCaaS) platform?**

No. CCaaS is the underlying telephony, routing, and agent-desktop infrastructure. AI call centre software sits on top — either embedded by the CCaaS vendor or integrated by a specialist. Most enterprise deployments keep CCaaS and AI as separate procurements so each can be replaced independently.

**Should we buy AI call centre software from our existing CCaaS vendor?**

Sometimes. The integration is easier and the contract is simpler, but the capability ceiling is usually lower than specialist platforms, particularly at tier three (autonomous agent). For containment on narrow intents the bundled option is often defensible; for broader intent variance, a specialist usually outperforms.

**What is the typical implementation timeline?**

Conversational IVR on a narrow intent set: 8–12 weeks to production. Agent-assist rollout across a contact-centre team: 12–20 weeks. Autonomous voice agent against the systems of record: 16–32 weeks before the first intent is in production, with phased intent expansion thereafter.

**How do we benchmark vendors on latency without running a bake-off?**

Ask for P95 turn-taking latency at your projected concurrent call volume in your nearest region, measured over a seven-day window. Any vendor that can answer in writing belongs on the shortlist; any vendor that can't has not deployed at your scale.

**What's a realistic containment range to expect across the three categories?**

Conversational IVR on narrow intents: 30–55%. Agent-assist does not contain — its lever is handle-time. Autonomous voice agent across broader intent variance: 20–45% on a representative call sample, higher on curated demo sets. Anything above 60% in a vendor pitch deck is almost always a curated number.

**Do we need a separate AI governance framework for this?**

Yes if your existing governance does not cover automated decisioning, sub-processor disclosure, model-change notification, and evidence export for regulators. The voice channel is not exempt from the AI-governance work the rest of the organisation is doing.

---

## Voicebots in the enterprise: where they fit, what they cost, and how they fail

URL: /guides/voicebot-enterprise-guide
Published: 2026-06-15 · Updated: 2026-06-15

**Bottom line up front.** A voicebot is the entry tier of voice AI: narrow intents, scripted flows with NLU on the front end, predictable economics. It is cheaper and faster to deploy than an autonomous voice agent and more capable than touch-tone IVR. The trap is assuming it scales into either neighbour — it doesn't.

### Voicebot, voice AI, autonomous agent — the terms are not interchangeable

The market uses 'voicebot', 'voice AI', and 'autonomous voice agent' as if they were synonyms. They are not. Each describes a different capability ceiling, a different deployment risk, and a different ROI profile.

### Use cases where voicebots actually pay back

Voicebots win on narrow, high-volume, structured intents where the cost of a human alternative is high and the cost of a miss-route is low. The deployment pattern is identical across industries: small intent set, real integration against one system of record, clear escalation path.

- Appointment scheduling, rescheduling, and cancellation against a real calendar system
- Order, delivery, and shipment status against an order-management system
- Account balance, payment status, and transaction history against a billing system
- Outage and service-status broadcasts with structured intake of address or account
- Password resets and account unlocks composed with the existing identity stack
- Pre-call qualification and warm transfer — the AI doesn't contain, it shortens the human call

### What a voicebot actually costs in 2026

Fully loaded voicebot economics in 2026 sit well below autonomous voice agents and well above touch-tone IVR. The numbers below assume a modern enterprise stack — cloud telephony, retrieval against a curated knowledge source, observability, and a contact-centre operating model.

### The failure modes that show up in month three

Voicebot deployments rarely fail at launch — they fail in month three when the long-tail intents start hitting and the fallback rate creeps up. The same handful of failure modes account for the majority of programme stalls.

### When to upgrade from a voicebot to an autonomous voice agent

A voicebot is the right answer until the intent variance you need to handle outgrows scripted flows. The signals are observable in your own data; don't wait for a vendor to tell you.

- Fallback rate sits above 20% on calls that are in your intended scope
- More than a third of contained calls re-contact within seven days
- Operations is maintaining more than ~30 distinct intent flows and the maintenance burden is the bottleneck
- The roadmap requires multi-turn reasoning across systems of record, not single-intent self-service
- Customers are asking compound questions that span multiple intents in one turn

### A six-question scorecard for choosing a voicebot

Voicebot procurement is a smaller scorecard than autonomous voice agent procurement, but the same six questions consistently separate the platforms that scale from the platforms that stall.

### Key takeaways

- A voicebot is the entry tier of voice AI — narrow scope, scripted flows with NLU, predictable economics
- Fully loaded 2026 cost is £0.06–£0.18/min plus a £20k–£90k platform fee, before operating-model labour
- Six failure modes account for most month-three stalls; intent creep and fallback erosion lead the list
- Containment caps at 25–50% on the in-scope intent set — beyond that you are buying an autonomous agent
- Voicebot and conversational IVR are the same product category under two names — pick by vendor terminology

### FAQs

**What's the difference between a voicebot and a chatbot?**

Channel and constraints. A chatbot operates in text, where users tolerate longer turns and visible UI affordances. A voicebot operates on the phone, where turn-taking latency under 800ms is table-stakes, there is no visual fallback, and barge-in handling determines whether the experience feels human or robotic. The underlying NLU can be shared; the conversational design rarely is.

**Is a voicebot the same thing as a conversational IVR?**

In practice, yes — 'voicebot' is the older term and 'conversational IVR' is the term most vendors prefer in 2026. Both describe a natural-language front end on a scripted intent set with deterministic flows behind it. See the conversational IVR guide for the modern framing.

**Can a voicebot handle PCI cardholder data?**

Only with a pause-and-resume DTMF capture pattern that keeps the digits out of the LLM context window. The voicebot orchestrates the call; a separate, certified capture flow handles the actual card number. Any vendor claiming generic PCI compliance without that pattern is selling marketing.

**What containment rate should we expect from a voicebot?**

25–50% on the intent set the voicebot is scoped for, measured on a representative call sample. The range is wider than autonomous voice agents because scope discipline is the dominant variable — a tightly-scoped voicebot can hit the high end; a sprawling intent set drags toward the low end.

**How long does it take to deploy a voicebot?**

8–12 weeks from contract to first production intent for a narrow scope against a well-integrated system of record. Add 2–4 weeks per regulated-industry control set, and 4–8 weeks if the integration is new rather than reusing an existing connector.

**Will a voicebot replace our contact-centre agents?**

No, and selling it internally on that basis usually fails. Voicebots remove a slice of structured, repetitive volume and shorten the calls that still reach a human. The economic story is volume deflection plus handle-time reduction, not headcount replacement — that framing also survives a works-council conversation that pure-headcount stories don't.

---

# Glossary

## Voice AI

URL: /glossary/voice-ai

Voice AI is a class of conversational AI that handles spoken telephone interactions end-to-end. It combines speech-to-text, a language model, and text-to-speech with telephony and integration layers so it can listen, understand intent, take action against systems of record, and respond in natural speech.

**How is voice AI different from an IVR?** An IVR follows scripted menus and accepts keypad or constrained voice input. Voice AI understands open-ended speech, holds context across turns, and can call into systems of record to take action — which is what allows it to resolve calls rather than route them.

**How is voice AI different from a chatbot?** Chatbots operate in text, usually asynchronously. Voice AI operates in real-time speech, which adds latency, naturalness, barge-in, and telephony constraints that text chat does not face.

**Is voice AI the same as agentic voice?** Not quite. Agentic voice usually refers to voice AI systems that plan multi-step actions across tools and systems of record; voice AI is the broader category.

---

## Containment rate

URL: /glossary/containment-rate

Containment rate is the share of calls handled end-to-end by an automated system — usually voice AI or an IVR — without escalation to a human agent. Formally: contained calls divided by total in-scope calls, over a defined time window.

**What is a good containment rate?** There is no universal benchmark. Transactional intents commonly reach 60–80% in production; complex enterprise call mixes more often sit in the 25–45% band. Compare against your own baseline, not a vendor average.

**Is containment rate the same as deflection rate?** Often used interchangeably. Where they differ, deflection usually includes calls prevented from reaching the queue at all; containment refers to calls that entered the AI flow and were not escalated.

**How should re-contact affect containment measurement?** Re-contact within a defined window for the same intent should be subtracted from the numerator. Without this adjustment, containment overstates resolution.

---

## Agentic voice

URL: /glossary/agentic-voice

Agentic voice refers to voice AI systems that plan and execute multi-step actions across tools and systems of record during a single call, rather than answering single-turn questions. The defining property is autonomous tool use under conversational control.

**How is agentic voice different from regular voice AI?** Regular voice AI responds turn by turn; agentic voice plans a sequence of actions, executes them across tools, and adapts when a step fails. The difference is most visible on intents that require three or more system interactions to resolve.

**What capabilities does agentic voice require?** Reliable tool use, durable state within a call, error handling on failed tool calls, and observability into the plan-execute loop. Without these, a system marketed as agentic will fail on the calls it most needs to handle.

**Is agentic voice mature enough for enterprise production?** For well-scoped intents with bounded tool surfaces, yes. For open-ended intents touching many systems, it is still early — most production deployments deliberately constrain the tool surface.

---

## Autonomous resolution rate

URL: /glossary/autonomous-resolution-rate

Autonomous resolution rate is the share of calls fully resolved by an AI system without human involvement and without the customer re-contacting for the same intent within a defined window (typically 7 days). It is a stricter alternative to containment rate.

**How is autonomous resolution rate calculated?** Contained calls minus calls that re-contacted for the same intent within the defined window, divided by total in-scope calls.

**What window should be used for re-contact?** Seven days is the most common internal standard. Some teams use 14 days for intents with longer resolution cycles such as claims or disputes.

**Is autonomous resolution rate the same as first-call resolution?** Related but not identical. First-call resolution traditionally applies to human-handled calls; autonomous resolution rate is its automation equivalent and uses a comparable re-contact window.

---

## Voice AI latency

URL: /glossary/voice-ai-latency

Voice AI latency is the end-to-end delay between the caller finishing speaking and the AI beginning to respond. It combines speech-to-text, language model inference, text-to-speech, and any integration calls on the critical path.

**What is an acceptable voice AI latency?** Under 1.5 seconds end-to-end is the practical target for production voice AI in 2026. Under 1 second is achievable with modern streaming stacks and disciplined integration design.

**What contributes most to voice AI latency?** Integration calls on the critical path, followed by language model inference. Speech-to-text and text-to-speech are usually small contributors when streaming.

**How is voice AI latency measured?** From the end of the caller's utterance (silence detection or end-of-turn) to the first audio frame returned by the AI. Measuring only model inference understates real latency.

---

## IVR replacement

URL: /glossary/ivr-replacement

IVR replacement is the migration from a touch-tone or directed-dialogue IVR to a voice AI system that handles open-ended speech and can take action against systems of record. It is rarely a like-for-like swap — the new system absorbs flows the IVR used to route away.

**Is voice AI always cheaper than IVR?** On a per-minute basis, no — voice AI has variable speech, language model, and TTS costs that IVR does not. The case for replacement rests on resolution rate and customer experience.

**Should IVR replacement be all-at-once or phased?** Phased almost always wins. Migrating one intent or intent cluster at a time allows measurement against the IVR baseline and keeps blast radius low if something regresses.

**What should be kept from the legacy IVR?** The routing and disaster-recovery paths. A voice AI deployment should fall back cleanly to the legacy IVR for any intent it is not configured to handle.

---

## Barge-in

URL: /glossary/barge-in

Barge-in is the ability for a caller to interrupt a voice AI mid-utterance and have the system stop speaking, listen, and respond to the interruption naturally. Without barge-in, the agent has to finish every sentence before the caller can react, which collapses the perceived realism of the interaction.

**Why is barge-in important for voice AI?** Because humans interrupt each other constantly in real conversation. A voice AI that cannot be interrupted feels mechanical within two turns and trains the caller to wait, which inflates handle time.

**What does poor barge-in look like in production?** The system either ignores the interruption entirely, stops but forgets what was being discussed, or stops but cannot recover when the caller's input was not a clean turn boundary.

**Is barge-in always desirable?** Almost always for general conversation. For legally required disclosures or compliance scripts, deployments often disable barge-in for specific segments and enable it everywhere else.

---

## Turn-taking latency

URL: /glossary/turn-taking-latency

Turn-taking latency is the delay between the caller finishing speaking and the voice AI recognising the turn has ended and beginning to respond. It combines end-of-turn detection, speech-to-text finalisation, language model inference, and text-to-speech start time. It is the most-felt component of perceived voice AI quality.

**What is a good turn-taking latency?** Under 1 second is achievable with streaming stacks and disciplined integration design. Under 1.5 seconds is the practical production target. Above 2 seconds, callers notice and start to disengage.

**Is turn-taking latency the same as voice AI latency?** Closely related. Voice AI latency usually refers to the full end-to-end delay; turn-taking latency specifically isolates the gap between turns, which is what the caller actually perceives.

**What contributes most to turn-taking latency?** End-of-turn detection and any tool calls on the critical path. Speech-to-text and text-to-speech are usually small contributors when streaming.

---

## Intent recognition

URL: /glossary/intent-recognition

Intent recognition is the process by which a voice AI identifies what the caller is trying to achieve, mapping open speech to a structured intent the system can act on. Modern LLM-driven voice AI often handles this with prompting rather than a separate classifier, but the function — turning ambiguous speech into a routable intent — is the same.

**How is intent recognition different in LLM-based voice AI?** Older systems used a discrete classifier with a fixed intent list; modern LLM voice AI can interpret intent directly from the conversation, which handles ambiguity better but makes evaluation harder. Most production systems combine both for observability.

**What is intent drift?** When the mix of intents the AI encounters in production diverges from what it was designed for — usually because customers learn what the AI can handle and start asking for things outside scope. Drift is normal; not measuring it is the problem.

**How many intents should a voice AI handle?** Start narrow. Three to seven well-handled intents in production beat thirty intents in a demo. Add intents one at a time, with measured success criteria for each.

---

## DTMF fallback

URL: /glossary/dtmf-fallback

DTMF (dual-tone multi-frequency) fallback is the design pattern of capturing sensitive input — card numbers, PINs, account numbers — through keypad tones rather than speech, so the voice AI, recording layer, and underlying model never hear the digits. It is the standard PCI-safe capture pattern for voice AI deployments.

**Is DTMF fallback required for PCI compliance?** Not by name, but a pattern that keeps card data out of the recorded audio and out of the model is required. DTMF capture routed to a PCI-scoped service is the most common implementation.

**When else is DTMF used in voice AI?** As a UX fallback when speech recognition fails repeatedly on the same digit-heavy input, and for any input where audible repetition would be unacceptable — such as in noisy environments.

**Does DTMF capture work with modern voice AI platforms?** Yes, but support varies. Confirm during evaluation that the platform supports DTMF capture, pause-and-resume recording, and routing of the captured digits to a separately scoped service without the model seeing them.

---

## Voice biometrics

URL: /glossary/voice-biometrics

Voice biometrics is the use of a caller's unique voice characteristics to verify identity. Modern implementations are usually passive — running in the background during the conversation — and combined with knowledge or device factors to meet step-up authentication requirements.

**Is voice biometrics secure enough to replace passwords?** On its own, rarely. Modern deployments use voice biometrics as one factor in a layered design, combined with device signals, knowledge factors, and risk-based step-up. The combined design clears authentication that voice alone would not.

**Can voice biometrics be fooled by synthetic voice?** Synthetic-voice attacks are a known threat, especially against voice-only authentication. Modern platforms add liveness detection, anti-spoofing, and behavioural signals to mitigate; the threat model is real and evolving.

**Does voice biometrics need explicit consent?** In most UK and EU contexts, yes — biometric data is special-category and requires explicit consent, retention limits, and DPIA evidence. Treat consent capture as part of the deployment design, not an afterthought.

---

## Real-time transcription

URL: /glossary/real-time-transcription

Real-time transcription is the streaming conversion of spoken audio to text with low enough latency that downstream systems — voice AI, agent assist, supervisor dashboards, compliance flags — can act on it during the call rather than after it. It is the input layer of every voice AI system.

**Is real-time transcription the same as voice AI?** No. Transcription is the speech-to-text input layer; voice AI uses transcription plus a language model and text-to-speech to hold a conversation. Many contact centres deploy transcription for agent assist and observability without deploying full voice AI.

**What accuracy should I expect from real-time transcription?** Word error rates of 5–15% on clean enterprise audio in the trained language are typical; higher with strong accents, heavy code-switching, or noisy lines. Evaluate on your own recorded audio, not a vendor demo set.

**Does real-time transcription introduce compliance risk?** Yes — transcripts are personal data and often contain special-category data. Storage, retention, and access controls for transcripts should match the controls on the underlying call recordings.

---

## Deflection rate

URL: /glossary/deflection-rate

Deflection rate is the share of inbound contacts moved out of the live-agent queue into an automated or asynchronous channel — voice AI, SMS, chat, web self-service, or proactive notification. Gross deflection counts every deflected contact; net deflection subtracts contacts that returned within a defined window for the same intent.

**Is deflection rate the same as containment rate?** Closely related. Deflection often includes contacts prevented from reaching the queue at all; containment usually refers specifically to calls that entered the AI flow and were not escalated.

**What window should be used for re-contact?** Seven days is the most common standard; 14 days for intents with longer natural resolution cycles such as claims or disputes.

**What is a good deflection rate?** Net 50–75% on transactional intents, 25–45% on mixed, 10–25% on complex. The blended figure depends entirely on intent mix.

---

## First call resolution

URL: /glossary/first-call-resolution

First call resolution (FCR) is the share of customer contacts fully resolved in the first interaction without a follow-up for the same issue within a defined window — typically 7 days. It is the human-channel ancestor of autonomous resolution rate.

**How is FCR different from autonomous resolution rate?** FCR applies to any first interaction (typically human); autonomous resolution rate is its automation equivalent. Both subtract re-contact within a defined window.

**What window should be used?** Seven days for most intents; 14 days for claims or disputes with longer cycles.

**Is FCR worth measuring if we have AHT?** Yes — AHT and FCR usually move in opposite directions, which is exactly why measuring both is required to spot perverse incentives.

---

## Average handle time

URL: /glossary/average-handle-time

Average handle time (AHT) is the mean total time an agent spends per contact — talk time plus hold time plus after-call work — typically reported in seconds. It is the dominant productivity metric in voice contact centres and the most-gamed.

**What is a good AHT?** There is no universal target — it depends entirely on intent mix. Compare against your own baseline by intent, never against a blended industry figure.

**Does voice AI reduce AHT?** On contained calls it removes agent time entirely. On escalated calls it usually reduces agent AHT by 30–90 seconds when the AI captures intent and identity before transfer.

**Why is AHT often the wrong target?** Squeezing AHT without watching re-contact pushes problems into a second call. The right joint target is cost per resolved contact, not cost per contact.

---

## After-call work

URL: /glossary/after-call-work

After-call work (ACW) is the time an agent spends completing a contact after the caller has disconnected — notes, case updates, transfers, follow-ups, and any required compliance logging. It is the most under-measured contributor to AHT.

**Is after-call work part of AHT?** Yes — AHT formally includes talk, hold, and ACW. Reports that exclude ACW understate the real cost of a contact.

**Can voice AI replace after-call work?** It can reduce it substantially via auto-summarisation, case pre-fill, and structured transcript handoff. Eliminating it usually requires deeper system-of-record integration.

**How is ACW measured?** Time between call disconnect and the agent marking themselves available. Measured cleanly in modern contact-centre platforms; harder in legacy stacks.

---

## Escalation rate

URL: /glossary/escalation-rate

Escalation rate is the share of calls handled by an automated system that hand off to a human agent before resolution. It is the inverse of containment rate. Escalation reasons — captured per call — are the single richest input to a voice-AI operating model.

**Is escalation rate the same as transfer rate?** Effectively yes for voice AI. Some platforms distinguish transfers (caller chose) from escalations (system chose); for measurement they are usually combined.

**What is a healthy escalation rate?** Depends entirely on intent mix and on whether escalations resolve cleanly. A 70% escalation rate on complex intents with high post-transfer resolution beats 30% with high re-contact.

**Should we tag escalation reasons?** Yes — escalation reasons drive every meaningful improvement to a voice AI after launch. Untagged escalations leave the operating-model team blind.

---

## End-of-turn detection

URL: /glossary/end-of-turn-detection

End-of-turn detection is the mechanism by which a voice AI decides the caller has finished speaking and it should begin to respond. It combines voice activity detection, semantic completion signals, and timing heuristics. It is usually the largest single contributor to turn-taking latency.

**Why is end-of-turn detection hard?** Because callers pause mid-sentence to think, and silence does not reliably signal a finished turn. Pure timeout-based detection trades off interruption against responsiveness.

**Can end-of-turn detection be tuned per intent?** Yes, and it should be. Disclosure scripts tolerate longer pauses; transactional intents benefit from faster turn-taking.

**How much latency does end-of-turn detection add?** Typically 200–800 ms depending on configuration. It is usually the largest single component of turn-taking latency.

---

## Hallucination rate

URL: /glossary/hallucination-rate

Hallucination rate is the share of voice AI utterances that contain a confident statement unsupported by retrieved evidence or current system state — a confidently wrong answer. It is measured per turn or per call and is the most consequential safety metric in regulated deployments.

**How is hallucination rate measured?** By sampling calls and rating each turn against retrieved evidence or system state. Automated graders are useful for triage but human review remains the gold standard for regulated content.

**What hallucination rate is acceptable?** Depends on consequence. Below 0.1% on high-stakes intents (payments, claims, advice) is typical for production; consumer Q&A often tolerates higher rates with clearer guardrails.

**Does retrieval eliminate hallucination?** It reduces hallucination substantially but does not eliminate it — retrieved evidence can be misinterpreted. Combine retrieval with strict refusal patterns and tool-grounded answers.

---

## Voice AI orchestration

URL: /glossary/voice-ai-orchestration

Voice AI orchestration is the layer that coordinates speech-to-text, language-model inference, text-to-speech, tool calls into systems of record, telephony events, and fallback paths into a single coherent call flow. It is the integration substrate that distinguishes a demo from a production-grade deployment.

**Is orchestration the same as the language model?** No. The model handles understanding and response generation; orchestration handles everything around it — speech I/O, tool routing, state, telephony, and fallback.

**What does poor orchestration look like in production?** Long pauses while tool calls run, lost context after barge-in, no graceful degradation when an integration fails, and opaque failure modes for the operating-model team.

**Should orchestration be built or bought?** Buy when the platform's orchestration meets the integration and observability needs; build when it does not and the volume justifies. See the build vs buy comparison for the decision matrix.

---

## SIP trunking

URL: /glossary/sip-trunking

SIP trunking is the delivery of voice calls between an enterprise and a telephony provider over an IP-based signalling protocol (Session Initiation Protocol). It is the substrate every voice AI deployment rides on, and the layer at which residency, recording, and DTMF capture decisions are made.

**Do I need SIP trunking for voice AI?** Effectively yes for enterprise inbound deployments. Even cloud contact-centre platforms terminate on SIP somewhere in the path.

**Does SIP affect voice AI latency?** Yes — codec, region, and trunk configuration each add 20–150 ms. Worth measuring end-to-end during evaluation.

**Is SIP trunking secure?** When configured with TLS for signalling and SRTP for media, yes. Many legacy deployments still run unencrypted SIP — a common audit finding.

---

## Voice cloning

URL: /glossary/voice-cloning

Voice cloning is the synthesis of a custom voice — typically based on samples from a brand actor, voice talent, or executive — for use as the voice AI's text-to-speech output. It combines voice identity (timbre, accent) with prosody control (pace, intonation, emotion).

**Does voice cloning meaningfully improve CSAT?** Rarely. Stock high-quality voices score within margin of error in most controlled tests; cloning is usually a brand decision rather than a CX one.

**What consent is required for voice cloning?** Explicit, written, scoped consent from the voice talent — including allowable contexts and duration of use. UK/EU contexts often add biometric-data treatment under GDPR.

**Can voice cloning be misused?** Yes — voice cloning is the supply side of synthetic-voice fraud. Enterprise programmes should pair cloning with anti-impersonation controls and clear internal use boundaries.

---

## Prompt injection (voice)

URL: /glossary/prompt-injection-voice

Prompt injection in voice AI is a spoken or transcribed attempt to override the agent's instructions, exfiltrate data, or escalate privilege through manipulated dialogue. It is the voice-channel equivalent of the text prompt-injection attack surface and is harder to detect because audio carries fewer attacker fingerprints.

**How is voice prompt injection different from text?** The mechanics are identical once audio becomes text. The difference is that audio carries fewer adversarial signals — no URLs, encoded payloads, or markdown — making automated detection harder.

**What is the highest-impact mitigation?** Strict tool-call scoping: the agent should never have access to system actions it would not need on a legitimate version of the call.

**Should we red-team voice AI?** Yes. Adversarial-dialogue red-teaming is a maturing practice and is increasingly expected in regulated-industry risk reviews.

---

## Conversational design

URL: /glossary/conversational-design

Conversational design is the discipline of shaping voice and chat AI dialogue — turn structure, persona, error recovery, confirmation patterns, escalation language — so the system produces measurable CX outcomes rather than merely accurate responses. It sits between product design, linguistics, and CX operations.

**Is conversational design a job title?** Increasingly yes — usually sitting in CX operations or product, sometimes with linguistics backgrounds. Mature deployments name a conversation owner explicitly.

**What is the most common design mistake?** Optimising for happy-path elegance and ignoring error recovery. Real conversations spend most of their value on the unhappy paths.

**Does conversational design replace prompt engineering?** No — they are complementary. Prompt engineering is one of the tools conversational design uses; the discipline is broader.

---

## LLM guardrails

URL: /glossary/llm-guardrails

LLM guardrails are the policy and runtime controls that constrain what a language model can say, do, and disclose during a conversation. They include topic restrictions, refusal patterns, tool-call scoping, output validators, and the safety layer that catches violations before they reach the caller.

**Are LLM guardrails the same as system prompts?** Overlapping but not identical. System prompts are one layer; runtime validators, refusal patterns, and tool-call scoping sit alongside them.

**What happens when a guardrail fires?** Best practice is a graceful refusal, optionally a route to a human, and a logged event for the operating-model team to review.

**Can guardrails block legitimate calls?** Yes — over-strict guardrails are a common cause of over-escalation. Tune against measured false-positive rates, not assumptions.

---

## Automated-system disclosure

URL: /glossary/automated-system-disclosure

Automated-system disclosure is the obligation to tell a caller they are interacting with an automated system rather than a human. It is required or expected in most major regulatory regimes and is distinct from recording consent — combining the two is the most common audit finding in early voice-AI deployments.

**Is automated-system disclosure legally required?** It depends on jurisdiction. The EU AI Act requires it for many use cases; multiple US states require it; UK ICO guidance recommends it. Treat it as required by default.

**What is the right wording?** Plain and early: 'You're speaking with an automated assistant.' Do not bury it in a script or conflate it with recording consent.

**Does disclosure hurt CSAT?** Properly worded, no. Hidden automation that becomes obvious later hurts CSAT more than upfront disclosure does.

---

## Voice AI evaluation

URL: /glossary/voice-ai-evaluation

Voice AI evaluation is the structured process of comparing voice AI platforms or deployments against measurable production criteria — integration depth, latency, observability, operating-model fit, safety, control surface, voice quality, telephony reach, and commercial model — rather than demo quality.

**How long should a voice AI evaluation take?** Six to ten weeks for a defensible enterprise evaluation: two weeks to build the call sample and integration test, four to six weeks running, one to two to analyse.

**What is the most under-weighted evaluation criterion?** Observability — what the platform lets the operating-model team see after launch. It is the single largest predictor of post-launch improvement.

**Can demos be useful?** As a baseline, yes. As a basis for procurement, no — demos predict almost nothing about production behaviour on your call mix.

---

## Voice AI ROI

URL: /glossary/voice-ai-roi

Voice AI ROI is the measured return on a voice AI programme, expressed defensibly as cost per resolved call against the pre-AI baseline. It includes the operating-model cost (conversation owner, platform owner, observability tooling) and subtracts re-contact within a defined window for the same intent.

**How is voice AI ROI calculated honestly?** Cost per resolved call against the pre-AI baseline, including operating-model cost and subtracting 7-day re-contact for the same intent.

**What is the most common overstatement?** Assuming demo-rate containment in production. The defensible measured rate is usually 15–30 points lower.

**Should we model per-resolution pricing?** Always, as one scenario. Per-resolution transfers containment risk to the vendor and is often the most defensible commercial structure for early deployments.

---

# Notes

## Why containment rate is the wrong KPI to put on a dashboard

URL: /notes/containment-rate-wrong-kpi · Published: 2026-06-15

Containment rate is the metric every voice AI deployment reports because it is the easiest one to define. It is also the metric most likely to be quietly inflated, mis-defined, or read in isolation by a steering committee that does not realise how much it is being told.

The honest version of the metric is autonomous resolution rate: contained calls, minus calls that re-contacted for the same intent within a defined window, divided by total in-scope calls. It is harder to measure, smaller in value, and much closer to what the customer would say if you asked them whether the AI fixed their problem.

A dashboard that shows containment without autonomous resolution rate next to it is a dashboard that rewards the wrong behaviour. Engineers optimising for containment will push the boundary of "contained" outwards until it includes calls that should have escalated; the re-contact line then quietly absorbs the cost a week later.

If you can only put one number in front of the executive sponsor, make it autonomous resolution rate. If you can put two, add CSAT on escalated calls — the early-warning indicator that scope has been pushed too far.

---

## The integration tax nobody prices in

URL: /notes/integration-tax-nobody-prices-in · Published: 2026-06-15

Most voice AI business cases price the platform, the per-minute cost, and the implementation services. Almost none price the integration depth required to make the platform useful — which is the line item that decides whether the deployment can write a record, schedule a callback, or take a payment without an engineer in the loop.

The pattern is consistent. A pilot is scoped against a read-only API surface because read-only is what the vendor can stand up quickly. The pilot demonstrates that the AI can answer balance and status questions. Stakeholders are pleased. The production gate exposes that resolving a real intent requires write access against a system of record that has authentication, idempotency, and audit requirements the pilot never tested.

By the time that surfaces, the budget has been spent on the platform, not on the integration work. The programme either delays for another quarter to do the integration properly, or it ships a read-only deployment that is technically a voice AI but functionally a self-service lookup.

Price the integration depth from week one. Treat the API surface, the identity layer, the write-side idempotency, and the observability into integration failures as primary line items. The platform is the cheap part of a voice AI programme; the integration is where the resolution comes from.

---

## Latency budgets and why 800 ms breaks the illusion

URL: /notes/latency-budget-and-why-800ms-breaks-the-illusion · Published: 2026-06-15

The most common voice AI complaint after launch is not that the agent gets things wrong — it is that the agent feels off. Almost always, the agent feels off because turn-taking latency has crept past the threshold where conversational realism holds.

Roughly 800 milliseconds is the gap human speakers tolerate between turns without noticing. Past that, the listener registers a pause; past 1.5 seconds, the listener starts to fill the silence; past 2 seconds, the listener concludes the system has not understood and begins again. Each of those behaviours inflates handle time and inflates the perceived error rate even when the model is right.

Most deployments spend the budget on integration calls placed on the critical path. A tool call to a CRM that takes 600 milliseconds eats most of the budget on its own. Moving that call off the critical path — speculating, caching, prefetching, or running it in parallel with TTS — is the single highest-leverage latency optimisation in most stacks.

Streaming end-to-end matters too. Streaming speech-to-text into a streaming language model into streaming text-to-speech compresses the budget because the system can start speaking before the model has finished. Non-streaming stacks rarely clear 1.2 seconds in production. Streaming stacks routinely clear 800.

---