Voice AI QA rubric: a call-review template the operating model can actually run
- Heads of Ops
- CX directors
A call-review rubric is the cheapest mechanism to make a weekly operating cadence compound. Without it, the deployment improves on whichever dimension the loudest reviewer raises. With it, improvement is directional and visible to a steering committee.
The eight QA dimensions
Score each call sampled on the dimensions below, 1 / 3 / 5. A dimension is either applicable to the call or marked N/A; do not score it 3 by default.
| Dimension | Score 1 | Score 3 | Score 5 |
|---|---|---|---|
| Intent capture | Wrong intent or no intent | Right intent on second attempt | Right intent on first turn |
| Identity / authentication | No verification, or verification failed | Verification completed with friction | Verification clean, appropriate strength for the action |
| Resolution completeness | No action taken or wrong action | Partial action, customer told what to do next | Action completed end-to-end without escalation |
| Conversational quality | Robotic, interruptive, repetitive | Functional, occasional disfluency mishandled | Natural prosody, graceful barge-in, clean recovery |
| Compliance | Disclosure or consent missed | Disclosure present, consent ambiguous | All required disclosures clean, consent captured per policy |
| Escalation hygiene | Blind transfer, no context | Context partially passed | Full context to the human, named reason, no caller re-explanation |
| Vulnerable-customer routing | Signal missed | Signal detected, routing slow | Signal detected, routed within policy, evidence logged |
| Data handling | PII spoken back unnecessarily, retained in transcript | PII handled correctly, some redaction gaps | PII handled per policy end-to-end, audit clean |
Sampling strategy that produces a defensible weekly score
A useful sample is 20 to 40 calls per week, stratified — not random and not vendor-curated. Vendor-curated samples drift toward calls that score well; random samples under-represent the failure modes that matter.
- Half the sample from contained calls, half from escalated calls. Failure modes hide in each pool differently.
- Stratify by intent class (transactional, mixed, complex) so no class is missed.
- Include three to five vulnerable-customer-flagged calls every week; if there are none flagged, that is its own finding.
- Score independently per reviewer before reconciling. Disagreement at reconciliation is the signal — score it.
How the rubric feeds the operating cadence
The rubric does not live in a spreadsheet nobody opens. It feeds the weekly review directly: the lowest-scoring dimension becomes the top intent / guardrail change for the coming week; the highest-variance dimension becomes the calibration topic for the reviewer team.
- Weekly: aggregate scores per dimension, named lowest-scoring dimension, named highest-variance dimension
- Monthly: trend lines per dimension, calibration session for any dimension where reviewer variance exceeds one point on average
- Quarterly: rubric review — add a dimension, retire one, rewrite a level if scoring guidance has drifted
Calibration — keeping reviewer scores honest
Two reviewers can score the same call three points apart. A calibration session every month — pick five calls, everyone scores independently, reconcile in a room — is the cheapest way to keep the rubric honest. Without it, the rubric becomes a measure of reviewer identity, not call quality.
Pick five calls from the last seven days. Have three reviewers score them independently against the eight dimensions above. The first calibration session is the resulting conversation.
- Score eight dimensions, 1 / 3 / 5, with explicit guidance per level — no defaulting to 3.
- Sample 20–40 calls weekly, stratified across contained and escalated, never vendor-curated.
- The lowest-scoring dimension becomes next week's top change; the highest-variance dimension is the next calibration topic.
- Calibrate reviewers monthly or the rubric measures reviewer identity, not call quality.
- Rubric belongs to the customer's operating model, not the vendor.
Frequently asked questions
- How many calls should we review per week?
- Twenty to forty, stratified across contained and escalated calls and across intent classes. Fewer than 20 and the variance dominates the signal; more than 40 and reviewer fatigue degrades scoring quality.
- Should the vendor do the QA?
- No. The QA function belongs to the customer's operating model. The vendor can contribute calls, can attend reviews, cannot own the rubric or the score.
- How is this rubric different from human-agent QA?
- It carries the same conversational and compliance dimensions but adds escalation hygiene, intent capture, and data handling — the dimensions where AI deployments fail in ways human agents do not.
Terms used in this guide
- Voice AI— Voice AI is software that answers the phone, understands what the caller wants, and takes action — not just a smarter IVR.
- Containment rate— Containment rate is the percentage of calls the automation finished on its own.
Lewis Crook — 20 years in enterprise technology, from FTSE 100 voice deployments to over a million AI-handled minutes a month across Asia-Pacific. Buyer, builder, and now working with CX leaders on enterprise voice AI. Writes The Voice AI Brief. Connect on LinkedIn. More about Lewis.
Related guides
Plus the Voice AI Readiness Diagnostic in the welcome email.
Welcome email includes the Voice AI Readiness Diagnostic. No second list, no extra form.