86.7% pass rate over 30 test conversations. We run a fixed suite against the live Anthropic API and publish the numbers — pass or fail.
Demo data — first live run pending
Pass rate
86.7%
Target: 85.0%
Avg score
82.0%
Partial credit across all checks. Target: 80.0%
Hallucinations blocked
12
Pre-send guard refuses invented order numbers before they reach a customer.
Last run 2026-05-23
Each point is one full eval-run against the live model. The dotted line marks our public pass-rate commitment.
Same eval-run, sliced two ways. Categories tell us which kinds of conversation still need prompt-engineering work; languages tell us where our tone or terminology drifts.
Lower pass-rate buckets are the next targets for prompt tuning.
We over-rotate the prompt on languages with weaker scores until they catch up.
Inventing an order number is the worst failure mode for a support AI — the customer trusts a fake reference, the merchant looks incompetent. Our pre-send guard makes this physically impossible to ship.
Cumulative
57 attempts blocked
Across every eval-run we've published, this many would-be hallucinations were caught before send.
Coverage
6 runs reviewed
Every snapshot on this page tests the guard against adversarial prompts.
Reliability
100% block rate
The guard is deterministic — it does not depend on the model behaving well.
Source-of-truth: every reply is regex-scanned against the matched order's order_name; any unmatched #1234-shape mention triggers a refusal. See `src/lib/ai-eval.ts#detectOrderHallucination`.
OrderWise is the only Shopify support app that publishes its eval scores. Here is the suite that drives every number on this page.
30+ curated cases across English, German, French, and Spanish. Each fixture mirrors a real customer-support scenario — order tracking, refunds, missing context, multi-turn negotiations, and adversarial prompts that try to trick the model into making up an order number.
We do not score archived replies. Every run hits the production Anthropic Claude API with the same prompts the merchant inbox uses, so the numbers reflect what real customers actually receive.
Each case lists what the reply must do — language, order-number references, required tool calls, forbidden phrases, word limits. Checks are deterministic; a run either passes or it doesn't.
Before any reply ships to a customer, we scan it for order-number-shaped strings that don't match the matched order. If the model invents "#9999" we refuse to send. This page tracks how many would-be hallucinations the guard caught.
100% transparent
Every fixture, expectation, and weight lives in the public repo. Run the suite yourself with `pnpm eval:ai` against your own Anthropic key.
Try OrderWise free for 14 days. No credit card. Cancel anytime.