Skip to content
Operating model

Voice AI QA rubric: a call-review template the operating model can actually run

  • Heads of Ops
  • CX directors
By Lewis CrookPublished
Bottom line up front

A call-review rubric is the cheapest mechanism to make a weekly operating cadence compound. Without it, the deployment improves on whichever dimension the loudest reviewer raises. With it, improvement is directional and visible to a steering committee.

The eight QA dimensions

Score each call sampled on the dimensions below, 1 / 3 / 5. A dimension is either applicable to the call or marked N/A; do not score it 3 by default.

Voice AI call-review rubric
DimensionScore 1Score 3Score 5
Intent captureWrong intent or no intentRight intent on second attemptRight intent on first turn
Identity / authenticationNo verification, or verification failedVerification completed with frictionVerification clean, appropriate strength for the action
Resolution completenessNo action taken or wrong actionPartial action, customer told what to do nextAction completed end-to-end without escalation
Conversational qualityRobotic, interruptive, repetitiveFunctional, occasional disfluency mishandledNatural prosody, graceful barge-in, clean recovery
ComplianceDisclosure or consent missedDisclosure present, consent ambiguousAll required disclosures clean, consent captured per policy
Escalation hygieneBlind transfer, no contextContext partially passedFull context to the human, named reason, no caller re-explanation
Vulnerable-customer routingSignal missedSignal detected, routing slowSignal detected, routed within policy, evidence logged
Data handlingPII spoken back unnecessarily, retained in transcriptPII handled correctly, some redaction gapsPII handled per policy end-to-end, audit clean

Sampling strategy that produces a defensible weekly score

A useful sample is 20 to 40 calls per week, stratified — not random and not vendor-curated. Vendor-curated samples drift toward calls that score well; random samples under-represent the failure modes that matter.

  1. Half the sample from contained calls, half from escalated calls. Failure modes hide in each pool differently.
  2. Stratify by intent class (transactional, mixed, complex) so no class is missed.
  3. Include three to five vulnerable-customer-flagged calls every week; if there are none flagged, that is its own finding.
  4. Score independently per reviewer before reconciling. Disagreement at reconciliation is the signal — score it.

How the rubric feeds the operating cadence

The rubric does not live in a spreadsheet nobody opens. It feeds the weekly review directly: the lowest-scoring dimension becomes the top intent / guardrail change for the coming week; the highest-variance dimension becomes the calibration topic for the reviewer team.

  • Weekly: aggregate scores per dimension, named lowest-scoring dimension, named highest-variance dimension
  • Monthly: trend lines per dimension, calibration session for any dimension where reviewer variance exceeds one point on average
  • Quarterly: rubric review — add a dimension, retire one, rewrite a level if scoring guidance has drifted

Calibration — keeping reviewer scores honest

Two reviewers can score the same call three points apart. A calibration session every month — pick five calls, everyone scores independently, reconcile in a room — is the cheapest way to keep the rubric honest. Without it, the rubric becomes a measure of reviewer identity, not call quality.

Do this on Monday

Pick five calls from the last seven days. Have three reviewers score them independently against the eight dimensions above. The first calibration session is the resulting conversation.

Key takeaways
  • Score eight dimensions, 1 / 3 / 5, with explicit guidance per level — no defaulting to 3.
  • Sample 20–40 calls weekly, stratified across contained and escalated, never vendor-curated.
  • The lowest-scoring dimension becomes next week's top change; the highest-variance dimension is the next calibration topic.
  • Calibrate reviewers monthly or the rubric measures reviewer identity, not call quality.
  • Rubric belongs to the customer's operating model, not the vendor.

Frequently asked questions

How many calls should we review per week?
Twenty to forty, stratified across contained and escalated calls and across intent classes. Fewer than 20 and the variance dominates the signal; more than 40 and reviewer fatigue degrades scoring quality.
Should the vendor do the QA?
No. The QA function belongs to the customer's operating model. The vendor can contribute calls, can attend reviews, cannot own the rubric or the score.
How is this rubric different from human-agent QA?
It carries the same conversational and compliance dimensions but adds escalation hygiene, intent capture, and data handling — the dimensions where AI deployments fail in ways human agents do not.

Terms used in this guide

  • Voice AIVoice AI is software that answers the phone, understands what the caller wants, and takes action — not just a smarter IVR.
  • Containment rateContainment rate is the percentage of calls the automation finished on its own.
Last reviewed: 2026-06-15. This guide is updated when production patterns shift; see the corrections page to flag anything that no longer matches reality.
About the author
Lewis Crook
Practitioner writer on enterprise voice AI

Lewis Crook — 20 years in enterprise technology, from FTSE 100 voice deployments to over a million AI-handled minutes a month across Asia-Pacific. Buyer, builder, and now working with CX leaders on enterprise voice AI. Writes The Voice AI Brief. Connect on LinkedIn. More about Lewis.

Newsletter
Liked this? Get the next edition.

Plus the Voice AI Readiness Diagnostic in the welcome email.

Welcome email includes the Voice AI Readiness Diagnostic. No second list, no extra form.