Verified daily

AI quality you
can measure

86.7% pass rate over 30 test conversations. We run a fixed suite against the live Anthropic API and publish the numbers — pass or fail.

Demo data — first live run pending

Pass rate

86.7%

Target: 85.0%

Avg score

82.0%

Partial credit across all checks. Target: 80.0%

Hallucinations blocked

12

Pre-send guard refuses invented order numbers before they reach a customer.

Last run 2026-05-23

Pass rate, last runs

Each point is one full eval-run against the live model. The dotted line marks our public pass-rate commitment.

0%25%50%75%100%Target: 85.0%2026-04-25: 73.3% (22/30)2026-05-02: 76.7% (23/30)2026-05-09: 80.0% (24/30)2026-05-16: 80.0% (24/30)2026-05-20: 83.3% (25/30)2026-05-23: 86.7% (26/30)04-2505-0205-0905-1605-2005-23

Where the model wins, where it struggles

Same eval-run, sliced two ways. Categories tell us which kinds of conversation still need prompt-engineering work; languages tell us where our tone or terminology drifts.

By scenario category

Lower pass-rate buckets are the next targets for prompt tuning.

BasicsBasics: 100.0% (10 pass / 0 fail)100.0% (10/10)FAQFAQ: 100.0% (5 pass / 0 fail)100.0% (5/5)Order trackingOrder tracking: 80.0% (4 pass / 1 fail)80.0% (4/5)RefundsRefunds: 80.0% (4 pass / 1 fail)80.0% (4/5)MultilingualMultilingual: 60.0% (3 pass / 2 fail)60.0% (3/5)

By language

We over-rotate the prompt on languages with weaker scores until they catch up.

FrenchFrench: 100.0% (3 pass / 0 fail)100.0% (3/3)EnglishEnglish: 90.9% (10 pass / 1 fail)90.9% (10/11)GermanGerman: 84.6% (11 pass / 2 fail)84.6% (11/13)SpanishSpanish: 66.7% (2 pass / 1 fail)66.7% (2/3)

Hallucination defence

Inventing an order number is the worst failure mode for a support AI — the customer trusts a fake reference, the merchant looks incompetent. Our pre-send guard makes this physically impossible to ship.

Cumulative

57 attempts blocked

Across every eval-run we've published, this many would-be hallucinations were caught before send.

Coverage

6 runs reviewed

Every snapshot on this page tests the guard against adversarial prompts.

Reliability

100% block rate

The guard is deterministic — it does not depend on the model behaving well.

Source-of-truth: every reply is regex-scanned against the matched order's order_name; any unmatched #1234-shape mention triggers a refusal. See `src/lib/ai-eval.ts#detectOrderHallucination`.

How we measure AI quality

OrderWise is the only Shopify support app that publishes its eval scores. Here is the suite that drives every number on this page.

  1. 1

    Fixed test conversations

    30+ curated cases across English, German, French, and Spanish. Each fixture mirrors a real customer-support scenario — order tracking, refunds, missing context, multi-turn negotiations, and adversarial prompts that try to trick the model into making up an order number.

  2. 2

    Live model, not a transcript

    We do not score archived replies. Every run hits the production Anthropic Claude API with the same prompts the merchant inbox uses, so the numbers reflect what real customers actually receive.

  3. 3

    Declared expectations, not vibes

    Each case lists what the reply must do — language, order-number references, required tool calls, forbidden phrases, word limits. Checks are deterministic; a run either passes or it doesn't.

  4. 4

    Pre-send hallucination guard

    Before any reply ships to a customer, we scan it for order-number-shaped strings that don't match the matched order. If the model invents "#9999" we refuse to send. This page tracks how many would-be hallucinations the guard caught.

100% transparent

Every fixture, expectation, and weight lives in the public repo. Run the suite yourself with `pnpm eval:ai` against your own Anthropic key.

Stop guessing whether your support AI is actually good

Try OrderWise free for 14 days. No credit card. Cancel anytime.