Framework
Voice AI evaluation matrix: a vendor-neutral nine-dimension scorecard
Equal weighting on all nine dimensions produces a scorecard nobody trusts. This matrix concentrates 60% of the score on the four dimensions that predict post-launch survival, with explicit scoring criteria at the 1, 3, and 5 levels and the question that catches over-claims on each.
| Dimension | Weight | Score 1 | Score 3 | Score 5 | Question that catches over-claims |
|---|---|---|---|---|---|
| Integration depth | 20% | Read-only against generic connectors; write requires custom engineering. | Read and write against the systems of record under evaluation, with documented auth patterns. | Read, write, idempotent retries, graceful failure handling, full audit, demonstrated against your actual identity provider. | Show a call that wrote to our system of record. Now show what happened when that write failed. |
| Operating-model fit | 15% | Every prompt or intent change requires engineering. | Conversation owner can change intents and guardrails through a controlled editor; engineering still owns deploys. | Conversation owner ships changes in under an hour with diff review, staging, and one-click rollback. No engineering ticket needed. | Have a non-engineer change an intent live and roll it back in the demo. |
| Observability | 15% | Aggregate dashboards; per-call inspection requires logs and engineering. | Per-call view with transcript, intent labels, and tool calls accessible to the conversation owner. | Per-call view including latency budget per step, tool responses, escalation reasons, drift signals, and one-click export to the audit trail. | Open a call from yesterday and tell me, in 30 seconds, why it escalated. |
| Latency | 10% | End-to-end turn latency above 2 seconds under realistic load. | End-to-end turn latency 1.2–1.8 seconds under realistic load. | End-to-end turn latency under 1.0 second under realistic load, with streaming TTS and disciplined critical-path integration design. | Run 20 parallel sessions and show the p95 turn-taking latency, not the mean. |
| Control surface | 10% | Single editor with no diff review, staging, or rollback. | Controlled editor with diff review and staging; rollback documented. | Versioned configuration, diff review, staging environments, one-click rollback, and audit logs that name the change author. | Show me the audit log for the last ten changes, by author and revert path. |
| Safety and compliance | 10% | Generic SOC 2 attestation; data residency unclear; consent handling generic. | Documented residency, redaction, consent variations, and DPIA support. | Per-call leg residency proof, sub-processor change-notification SLA, model isolation guarantee, and an exit-and-destruction plan that survives legal review. | Send me the DPIA template and the data-flow diagram by call leg. |
| Voice quality | 8% | Robotic prosody, no barge-in, poor recovery on disfluency. | Natural prosody, basic barge-in, acceptable recovery on most disfluencies. | Natural prosody, graceful barge-in without losing turn context, accent-robust ASR, and tunable persona. | Run the demo with three regional accents and an interrupting caller. |
| Telephony reach | 7% | Limited SIP, single-tenancy carrier; contact-centre platform integrations are roadmap. | SIP plus one or two major contact-centre platforms in production. | SIP, all major contact-centre platforms, carrier-agnostic, with documented HA and disaster-recovery patterns. | Walk me through a carrier outage scenario and how the deployment survives it. |
| Commercial model | 5% | Per-minute pricing with no resolution alignment; surprise overage clauses. | Per-minute with volume bands; clear minute-floor on escalated calls. | Per-resolved-call pricing with a transparent resolution definition, or a platform fee with predictable variable lines and documented behaviour at 10x volume. | Model cost at 0.5x and 2x my modelled containment — what changes? |
| Total | 100% | ||||
How to use this matrix
- Score each candidate platform independently per dimension, evidence-first.
- Reconcile only after all scores are in; document evidence, not just numbers.
- Pre-commit a tie-breaker — usually integration depth or operating-model fit.
- Treat safety and compliance as a gate criterion in regulated industries: a failure removes the vendor from consideration regardless of total score.
What to avoid
- Equal weighting across all nine dimensions — produces a scorecard nobody trusts.
- Voice quality above 10% — the axis is real but rapidly converging across the market.
- Scoring "integrations" by counting logos rather than measuring read/write depth.
- Deferring the operating-model question to implementation — by then, the choice is locked in.
A note on this matrix
Vendor-neutral by design. No platform is named, ranked, or recommended. The same matrix is in active use by enterprise procurement teams in financial services, insurance, and telco; the weightings here reflect what consistently predicts post-launch survival.