Voice AI security and compliance: the enterprise buyer's checklist
- Procurement / IT-Sec
- VP / COO
- CX directors
Voice AI security is not a model problem — it is a data-flow problem. The questions that decide whether a deployment is approvable concern where audio, transcripts, and PII travel; what the model provider retains; how recording consent is captured; and whether the deployment survives a regulator's data-flow diagram.
The four data flows that decide approval
Most security reviews collapse into four flows. Get clean answers on each, in writing, before procurement closes.
- Audio in transit — codec, encryption, routing path, geographic transit
- Audio at rest — recording storage, retention window, encryption at rest, deletion guarantee
- Transcript and prompt — where it is stored, who can see it, whether it is used for model training
- PII in tool calls — what is sent to systems of record, what is masked or tokenised before reaching the model
The compliance regimes that show up most often
Different regimes care about different parts of the stack; a single "compliance" answer rarely covers them all.
- PCI DSS — applies the moment card data enters the audio stream; usually requires pause-and-resume or DTMF capture so the model never hears the digits
- HIPAA — applies to PHI in healthcare contexts; requires BAAs across every data-handling vendor including model providers and recording storage
- GDPR / UK GDPR — lawful basis, recording consent, data subject rights, transfer mechanisms outside the UK/EEA, DPIA artefacts
- FCA / financial-services rules — call recording retention, vulnerable-customer handling, fair treatment evidence
- Sector-specific — telco lawful intercept, insurance complaint logging, public-sector accessibility duties
Data residency — the question that breaks most pilots
Many voice AI platforms advertise multi-region deployment but route inference, fine-tuning, or evaluation through a single region. UK and EU buyers should ask, in writing: where is each of speech-to-text, language-model inference, text-to-speech, and recording storage physically processed and stored, for every call. "Available in EU" is not the same as "runs in EU end-to-end."
Recording consent and the voice AI exception
A voice AI announcement is not automatically a recording disclosure. Many jurisdictions require both — that the caller is informed they are speaking with an automated system and, separately, that the call is recorded. Conflating the two is the most common consent failure surfaced by post-deployment audits.
Model-provider data handling
The single highest-leverage clause to negotiate is whether the underlying model provider retains, logs, or trains on the audio, transcripts, or tool-call payloads. Default settings on hosted model APIs frequently allow some form of retention; enterprise tenancies typically do not. The voice AI platform usually controls this — but only if the buyer asks.
The questions that catch over-claims
Three questions consistently expose marketing gaps: show me a data-flow diagram for one complete call including every third party; show me where PCI-relevant data is masked, and what proves it; show me the retention and deletion policy for audio, transcript, and tool-call logs separately. If any of those answers is verbal-only, the deployment is not approvable.
The compliance gates that decide vendor shortlists in BFSI and healthcare
Three gate criteria reliably decide which vendors make it past a regulated buyer's first round. They are not weighted dimensions; failing any one removes the vendor from consideration.
- Data residency that matches the regulatory jurisdiction, demonstrable per call
- Recording and consent handling that survives an audit, including jurisdictional variation
- DPIA (Data Protection Impact Assessment) and DPA (Data Processing Agreement) support that the procurement and legal teams can sign without exception
PII handling on the voice path — the patterns that work
Voice AI handles more sensitive data per turn than almost any other enterprise system. Three patterns recur in defensible implementations.
- Tokenisation at the speech-to-text boundary — sensitive fields are extracted and replaced with tokens before the transcript hits the language model
- Redaction at the storage boundary — transcripts retained for review have sensitive fields removed at write time, not retrospectively
- Just-in-time decryption for the language model — sensitive data is decrypted in the context of a single turn and not retained in conversation state
PCI on the voice path
Payment card data on a voice call is the most-regulated piece of data the system will handle. The defensible pattern is a dedicated capture surface — DTMF-based card entry routed through a separate, PCI-scoped path — rather than allowing card numbers into the speech-to-text stream at all. Some platforms now support voice-based card capture with full PCI scope reduction; verifying the QSA-attested implementation in detail is non-negotiable before going live.
Recording, consent, and jurisdictional variation
Call recording is the operational backbone of voice AI — without it, the conversation owner has nothing to review and the regulator has nothing to audit. The implementation needs to respect three constraints simultaneously: jurisdictional consent requirements (one-party in some jurisdictions, two-party in others), the right to erasure under GDPR and UK GDPR, and the retention requirements set by sector regulators. A single global recording rule rarely satisfies all three.
Model and prompt governance
Two governance questions catch the gaps that operational security reviews usually miss. First, who can change a system prompt and what audit trail does that change leave? Second, what happens when the underlying model is updated by the vendor — is there a known evaluation gate, or does the change ship silently? Both questions are increasingly relevant as model providers roll out updates that change behaviour materially. A platform without answers to both is a platform that will, eventually, surface an unexplained behaviour change in production.
Vendor due diligence questions worth asking
Six questions consistently separate vendors that will pass a regulated buyer's procurement review from those that will not.
- Where exactly is data processed and stored, per call leg, per region?
- Which sub-processors are in the data path, and what is the change-notification process?
- How is the underlying language model isolated from training on customer data?
- What is the SOC 2 / ISO 27001 / PCI scope, and does the attestation cover the specific deployment topology being sold?
- What is the breach-notification SLA and the historical record of incidents?
- What is the exit plan — how is data returned and how is it provably destroyed at contract end?
DPA non-negotiables — the clauses worth holding the line on
Most DPA negotiations conclude on the same handful of clauses. The table below names the ones that materially change risk exposure rather than legal hygiene, with the defensible customer position next to the standard vendor opener.
| Clause | Default vendor position | Defensible customer position |
|---|---|---|
| Sub-processor change | 30 days, no objection right | 30 days + objection right + termination-for-convenience + transition support |
| Model training on customer data | Permitted unless opted out | Prohibited unless opted in |
| Audit rights | Annual, vendor-coordinated, no on-site | Annual on-site OR independent auditor report; for-cause audit on incident |
| Breach notification | From disclosure decision, 72h | From confirmation, 24h first notice, 72h detail |
| Data export on termination | Standard format, no SLA | Named format, 30-day SLA, proof of destruction within 90 days |
| Cross-border transfer mechanism | SCCs by reference | SCCs attached, current version, plus transfer impact assessment |
| Liability cap on data incident | 12 months fees, mutual | Super-cap (2–3x) for data breach and security incidents |
Per-call-leg residency — the diagram every regulated buyer needs
Headline residency is a marketing claim. Per-call-leg residency is a procurement artifact. The diagram should show, for a single live call, the legal entity and jurisdiction processing each of: PSTN ingress, capture, ASR, retrieval, LLM inference, TTS, egress, recording storage, transcript storage, derived analytics.
The most common gap: ASR or LLM inference routed to a US-based provider while the rest of the stack sits in-region. Both can be true simultaneously and both are usually disclosed if you ask the right question — but only the per-leg diagram surfaces it cleanly. Make it a contractual obligation that the diagram is current at all times, with a change-notification SLA that mirrors sub-processors.
- Voice AI security is a data-flow problem, not a model problem.
- Four flows decide approval: audio in transit, audio at rest, transcript/prompt handling, PII in tool calls.
- PCI usually requires pause-and-resume or DTMF capture so the model never hears card digits.
- Data residency claims are routinely overstated — get a per-component, per-call written data-flow.
- Automated-system disclosure is not the same as recording consent — most jurisdictions require both.
Frequently asked questions
- Is voice AI PCI compliant?
- The platform is not — the deployment is. PCI compliance depends on whether card data ever enters the audio stream the model hears. The standard pattern is pause-and-resume or DTMF capture, with the digits routed to a PCI-scoped service the model never sees.
- Is voice AI GDPR compliant?
- GDPR compliance depends on lawful basis, transfer mechanism, retention, and consent — none of which the platform decides on its own. Treat "GDPR-compliant" as a starting position, then walk through the data-flow per use case.
- Where is voice AI data processed?
- It varies by platform and by call. Ask for a written, per-component data-flow: speech-to-text, language-model inference, text-to-speech, recording. "Available in your region" is not the same as "runs in your region."
- Does the model provider retain my call data?
- By default, often yes — many hosted model APIs retain inputs for abuse monitoring or evaluation. Enterprise tenancies typically allow zero retention, but only if the voice AI platform passes the right flags. Confirm in writing.
- What does a defensible recording-consent script look like?
- Two distinct disclosures: that the caller is interacting with an automated system, and that the call is being recorded — with the lawful basis, retention, and opt-out path. Conflating the two is the most common audit finding.
- Do I need a DPIA for a voice AI deployment?
- In UK/EU contexts, almost always yes. Voice AI involves automated processing of personal data, often at scale, often touching special-category data. A DPIA is the cheapest insurance you will buy on the programme.
Terms used in this guide
- Voice AI— Voice AI is software that answers the phone, understands what the caller wants, and takes action — not just a smarter IVR.
- IVR replacement— IVR replacement swaps menus and keypad input for natural conversation and actual resolution.
- DTMF fallback— DTMF fallback uses the keypad to capture digits the model is not allowed to hear.
- Voice biometrics— Voice biometrics confirms who the caller is by how they speak.
- Real-time transcription— Real-time transcription is streaming speech-to-text fast enough to act on mid-call.
Lewis Crook — 20 years in enterprise technology, from FTSE 100 voice deployments to over a million AI-handled minutes a month across Asia-Pacific. Buyer, builder, and now working with CX leaders on enterprise voice AI. Writes The Voice AI Brief. Connect on LinkedIn. More about Lewis.
Related guides
- How to evaluate enterprise voice AI platforms: a vendor-neutral framework
- Enterprise voice AI integration depth: a real evaluation checklist
- Why enterprise voice AI pilots fail to reach production
- EU AI Act voice AI classification: limited, high-risk, or out of scope?
- PCI DSS v4.0 and voice AI: keeping cardholder data out of the model
Plus the Voice AI Readiness Diagnostic in the welcome email.
Welcome email includes the Voice AI Readiness Diagnostic. No second list, no extra form.