Framework

Voice AI evaluation matrix: a vendor-neutral nine-dimension scorecard

Equal weighting on all nine dimensions produces a scorecard nobody trusts. This matrix concentrates 60% of the score on the four dimensions that predict post-launch survival, with explicit scoring criteria at the 1, 3, and 5 levels and the question that catches over-claims on each.

Dimension	Weight	Score 1	Score 3	Score 5	Question that catches over-claims
Integration depth	20%	Read-only against generic connectors; write requires custom engineering.	Read and write against the systems of record under evaluation, with documented auth patterns.	Read, write, idempotent retries, graceful failure handling, full audit, demonstrated against your actual identity provider.	Show a call that wrote to our system of record. Now show what happened when that write failed.
Operating-model fit	15%	Every prompt or intent change requires engineering.	Conversation owner can change intents and guardrails through a controlled editor; engineering still owns deploys.	Conversation owner ships changes in under an hour with diff review, staging, and one-click rollback. No engineering ticket needed.	Have a non-engineer change an intent live and roll it back in the demo.
Observability	15%	Aggregate dashboards; per-call inspection requires logs and engineering.	Per-call view with transcript, intent labels, and tool calls accessible to the conversation owner.	Per-call view including latency budget per step, tool responses, escalation reasons, drift signals, and one-click export to the audit trail.	Open a call from yesterday and tell me, in 30 seconds, why it escalated.
Latency	10%	End-to-end turn latency above 2 seconds under realistic load.	End-to-end turn latency 1.2–1.8 seconds under realistic load.	End-to-end turn latency under 1.0 second under realistic load, with streaming TTS and disciplined critical-path integration design.	Run 20 parallel sessions and show the p95 turn-taking latency, not the mean.
Control surface	10%	Single editor with no diff review, staging, or rollback.	Controlled editor with diff review and staging; rollback documented.	Versioned configuration, diff review, staging environments, one-click rollback, and audit logs that name the change author.	Show me the audit log for the last ten changes, by author and revert path.
Safety and compliance	10%	Generic SOC 2 attestation; data residency unclear; consent handling generic.	Documented residency, redaction, consent variations, and DPIA support.	Per-call leg residency proof, sub-processor change-notification SLA, model isolation guarantee, and an exit-and-destruction plan that survives legal review.	Send me the DPIA template and the data-flow diagram by call leg.
Voice quality	8%	Robotic prosody, no barge-in, poor recovery on disfluency.	Natural prosody, basic barge-in, acceptable recovery on most disfluencies.	Natural prosody, graceful barge-in without losing turn context, accent-robust ASR, and tunable persona.	Run the demo with three regional accents and an interrupting caller.
Telephony reach	7%	Limited SIP, single-tenancy carrier; contact-centre platform integrations are roadmap.	SIP plus one or two major contact-centre platforms in production.	SIP, all major contact-centre platforms, carrier-agnostic, with documented HA and disaster-recovery patterns.	Walk me through a carrier outage scenario and how the deployment survives it.
Commercial model	5%	Per-minute pricing with no resolution alignment; surprise overage clauses.	Per-minute with volume bands; clear minute-floor on escalated calls.	Per-resolved-call pricing with a transparent resolution definition, or a platform fee with predictable variable lines and documented behaviour at 10x volume.	Model cost at 0.5x and 2x my modelled containment — what changes?
Total	100%

How to use this matrix

Score each candidate platform independently per dimension, evidence-first.
Reconcile only after all scores are in; document evidence, not just numbers.
Pre-commit a tie-breaker — usually integration depth or operating-model fit.
Treat safety and compliance as a gate criterion in regulated industries: a failure removes the vendor from consideration regardless of total score.

What to avoid

Equal weighting across all nine dimensions — produces a scorecard nobody trusts.
Voice quality above 10% — the axis is real but rapidly converging across the market.
Scoring "integrations" by counting logos rather than measuring read/write depth.
Deferring the operating-model question to implementation — by then, the choice is locked in.

A note on this matrix

Vendor-neutral by design. No platform is named, ranked, or recommended. The same matrix is in active use by enterprise procurement teams in financial services, insurance, and telco; the weightings here reflect what consistently predicts post-launch survival.

Voice AI evaluation matrix: a vendor-neutral nine-dimension scorecard

How to use this matrix

What to avoid

Related