How many calls should we review per week?

Twenty to forty, stratified across contained and escalated calls and across intent classes. Fewer than 20 and the variance dominates the signal; more than 40 and reviewer fatigue degrades scoring quality.

Should the vendor do the QA?

No. The QA function belongs to the customer's operating model. The vendor can contribute calls, can attend reviews, cannot own the rubric or the score.

How is this rubric different from human-agent QA?

It carries the same conversational and compliance dimensions but adds escalation hygiene, intent capture, and data handling — the dimensions where AI deployments fail in ways human agents do not.

Operating model

Voice AI QA rubric: a call-review template the operating model can actually run

Heads of Ops
CX directors

By Lewis CrookPublished June 15, 2026

Bottom line up front

A call-review rubric is the cheapest mechanism to make a weekly operating cadence compound. Without it, the deployment improves on whichever dimension the loudest reviewer raises. With it, improvement is directional and visible to a steering committee.

The eight QA dimensions

Score each call sampled on the dimensions below, 1 / 3 / 5. A dimension is either applicable to the call or marked N/A; do not score it 3 by default.

Voice AI call-review rubric

Dimension	Score 1	Score 3	Score 5
Intent capture	Wrong intent or no intent	Right intent on second attempt	Right intent on first turn
Identity / authentication	No verification, or verification failed	Verification completed with friction	Verification clean, appropriate strength for the action
Resolution completeness	No action taken or wrong action	Partial action, customer told what to do next	Action completed end-to-end without escalation
Conversational quality	Robotic, interruptive, repetitive	Functional, occasional disfluency mishandled	Natural prosody, graceful barge-in, clean recovery
Compliance	Disclosure or consent missed	Disclosure present, consent ambiguous	All required disclosures clean, consent captured per policy
Escalation hygiene	Blind transfer, no context	Context partially passed	Full context to the human, named reason, no caller re-explanation
Vulnerable-customer routing	Signal missed	Signal detected, routing slow	Signal detected, routed within policy, evidence logged
Data handling	PII spoken back unnecessarily, retained in transcript	PII handled correctly, some redaction gaps	PII handled per policy end-to-end, audit clean

Sampling strategy that produces a defensible weekly score

A useful sample is 20 to 40 calls per week, stratified — not random and not vendor-curated. Vendor-curated samples drift toward calls that score well; random samples under-represent the failure modes that matter.

Half the sample from contained calls, half from escalated calls. Failure modes hide in each pool differently.
Stratify by intent class (transactional, mixed, complex) so no class is missed.
Include three to five vulnerable-customer-flagged calls every week; if there are none flagged, that is its own finding.
Score independently per reviewer before reconciling. Disagreement at reconciliation is the signal — score it.

How the rubric feeds the operating cadence

The rubric does not live in a spreadsheet nobody opens. It feeds the weekly review directly: the lowest-scoring dimension becomes the top intent / guardrail change for the coming week; the highest-variance dimension becomes the calibration topic for the reviewer team.

Weekly: aggregate scores per dimension, named lowest-scoring dimension, named highest-variance dimension
Monthly: trend lines per dimension, calibration session for any dimension where reviewer variance exceeds one point on average
Quarterly: rubric review — add a dimension, retire one, rewrite a level if scoring guidance has drifted

Calibration — keeping reviewer scores honest

Two reviewers can score the same call three points apart. A calibration session every month — pick five calls, everyone scores independently, reconcile in a room — is the cheapest way to keep the rubric honest. Without it, the rubric becomes a measure of reviewer identity, not call quality.

Do this on Monday

Pick five calls from the last seven days. Have three reviewers score them independently against the eight dimensions above. The first calibration session is the resulting conversation.

Key takeaways

Score eight dimensions, 1 / 3 / 5, with explicit guidance per level — no defaulting to 3.
Sample 20–40 calls weekly, stratified across contained and escalated, never vendor-curated.
The lowest-scoring dimension becomes next week's top change; the highest-variance dimension is the next calibration topic.
Calibrate reviewers monthly or the rubric measures reviewer identity, not call quality.
Rubric belongs to the customer's operating model, not the vendor.

Frequently asked questions

How many calls should we review per week?: Twenty to forty, stratified across contained and escalated calls and across intent classes. Fewer than 20 and the variance dominates the signal; more than 40 and reviewer fatigue degrades scoring quality.
Should the vendor do the QA?: No. The QA function belongs to the customer's operating model. The vendor can contribute calls, can attend reviews, cannot own the rubric or the score.
How is this rubric different from human-agent QA?: It carries the same conversational and compliance dimensions but adds escalation hygiene, intent capture, and data handling — the dimensions where AI deployments fail in ways human agents do not.

Terms used in this guide

Voice AI— Voice AI is software that answers the phone, understands what the caller wants, and takes action — not just a smarter IVR.
Containment rate— Containment rate is the percentage of calls the automation finished on its own.

Last reviewed: 2026-06-15. This guide is updated when production patterns shift; see the corrections page to flag anything that no longer matches reality.

About the author

Lewis Crook

Practitioner writer on enterprise voice AI

Lewis Crook — 20 years in enterprise technology, from FTSE 100 voice deployments to over a million AI-handled minutes a month across Asia-Pacific. Buyer, builder, and now working with CX leaders on enterprise voice AI. Writes The Voice AI Brief. Connect on LinkedIn. More about Lewis.

Newsletter

Liked this? Get the next edition.

Plus the Voice AI Readiness Diagnostic in the welcome email.

Welcome email includes the Voice AI Readiness Diagnostic. No second list, no extra form.