Back to blog
7 min readby Senni

Why we publish our AI eval scores (and most vendors won't)

ai qualityevaluationtrustshopify

A while back I watched a competitor's chatbot tell a customer their refund had been "processed and will appear in 3-5 business days." No refund had been issued. There was no refund. The model had pattern-matched its way to a sentence that sounded like every refund email it had ever seen, and it said it with total confidence to a real person who had paid real money.

That is the failure mode that should keep any store owner up at night. A chatbot inventing a tracking number, promising a delivery date that doesn't exist, or confirming a refund that was never created isn't a typo. It's the AI lying to your customer in your name, and you find out when the chargeback arrives.

So when an AI support vendor tells you "we resolve 60% of tickets," my first question is: prove it. My second is: what happens in the other 40%, and how often does the AI make something up?

The resolution-rate number is almost always unfalsifiable

Pick any AI support app in the Shopify store and you'll find a headline percentage. 60% resolution. 70%. "Up to 80%." These numbers share a useful property for the vendor: you cannot check them.

"Resolution" is undefined. Did the customer get a correct answer, or did they just stop replying? A ticket where the AI says something wrong and the frustrated customer gives up looks identical, in the dashboard, to a ticket the AI actually solved. Both get counted as "deflected." The vendor picks the definition that flatters the number, and there's no fixture, no test set, no log you can audit.

I'm not accusing anyone of fraud. I'm saying the metric is structurally meaningless. A claim you can't test isn't a quality signal. It's marketing copy with a percent sign.

When the AI is going to talk to your customers about their orders and their money, "trust me, it's good AI" is not an acceptable answer. You'd never accept it from a payment processor. You shouldn't accept it from the thing writing replies under your brand.

What we actually measure

OrderWise runs a fixed eval. There's a file of test conversations — real customer messages paired with a known order context — and every one of them declares exactly what a correct reply has to do. We run our live model against that set and score each reply against its declared checks.

The conversations don't change between runs, which is the whole point. If the score moves, it's because the model or the prompt changed, not because we cherry-picked easier questions this week. A few of the dimensions we score:

  • Order context used. If the fixture supplies a matched order, the reply has to actually reference it — the order number, its status, the real shipping data. A reply that ignores the order it was handed and answers with a generic "shipping takes 3-5 days" fails, even if it reads nicely.
  • Correct language. A German customer gets a German reply. A French one gets French. We classify the language of the output and fail the case if it drifts back to English, because "multi-language" that quietly falls back to English on edge cases isn't multi-language.
  • No fabricated order data. This is the strict one. If the reply invents an order number, a tracking number, or a name that wasn't in the supplied context, the case fails outright. There is no partial credit for a confident lie.

That last check isn't only a grading rule. The same logic runs in production as a pre-send guard: before a reply reaches a customer, it's checked for invented order references, and if it tripped the guard, it doesn't get sent. The eval measures something the live system also enforces.

The hallucination guard, concretely

The phrase "hallucination guard" gets thrown around loosely, so here's what it means in our case, mechanically.

The AI is only allowed to state order facts that came from a real lookup against the store's Shopify data. The order number, the carrier, the tracking link, the ETA — all of it has to trace back to an actual API result. If the model tries to produce an order reference that wasn't in the data it was given, the guard catches it before send and the reply is blocked rather than delivered.

The public target for this one is 100%. Not "low." Not "industry-leading." Every hallucination attempt in the eval has to be blocked, because the cost of one getting through — a customer told a refund happened when it didn't — is so much higher than the cost of the AI saying "let me check that for you." A wrong answer about someone's money is worse than a slow one. We grade accordingly.

Why I put the scores on a public page

We publish the results at our live quality report: the pass rate across the fixture set, the average score, how many hallucination attempts the guard blocked, and a breakdown by language and category. It's not a screenshot from a good week. It updates as we run the eval, and the targets are stated up front — an 85% pass rate, a 0.8 average score, 100% of hallucinations blocked.

I do this for a slightly self-interested reason: publishing the number forces us to keep it honest. Once a score is on a page a merchant can read, you can't quietly let it slide. A regression in the prompt that drops the pass rate becomes visible, to me and to anyone evaluating us. That pressure is the point.

It also flips the burden of proof. Instead of asking you to believe a tagline, I'm handing you the test and the result and inviting you to be skeptical. If a competitor's "70% resolution" is real, they can publish the fixtures and the scoring the same way. The fact that almost none of them do tells you something about how much those numbers would survive contact with a public page.

The honest limitation: an eval is a floor, not a guarantee

I want to be straight about what this does and doesn't promise, because overclaiming here would defeat the entire point of the exercise.

The eval is a fixed set of conversations. Real customers will always find phrasings and edge cases the fixtures don't cover. A high pass rate means the AI reliably does the right thing on the situations we've encoded — it does not mean it will be perfect on a message no one anticipated. The language classifier is a heuristic and will occasionally shrug on very short text. The fixture set grows as we see new failure modes, but it's always behind reality by some margin.

So the eval is a floor, not a ceiling. It's the guarantee that we haven't shipped a regression on the things that matter most, and that the hallucination guard is doing its job on every case we can think to throw at it. It is emphatically not a promise that the AI is flawless. Anyone who tells you their support AI never makes a mistake is doing the exact thing I started this post complaining about.

What the floor buys you is the thing that actually matters: when our AI doesn't know, it's built to say so rather than invent. The guard makes "I don't have that information, let me get a human" the failure mode instead of a fabricated tracking number. For a store owner, a slow handoff is an annoyance. A confident lie about a customer's money is a refund, a bad review, and a trust problem you didn't choose.

How to use any of this when you're shopping

If you're comparing support AI for your store, the test is simple. Ask the vendor for their eval. Not their resolution rate — the fixtures, the scoring dimensions, and the live numbers. Ask specifically what happens when the model doesn't have the data: does it block and hand off, or does it guess? Most of this is on our pricing page and answered plainly in the FAQ, and where it isn't, ask me directly.

If a vendor can't show you a test, they don't have one. They have a number.

You can see ours, updated as we run it, on the quality report. Read it skeptically. That's what it's there for.

Try OrderWise free

AI customer support for Shopify that actually knows your orders. Free plan: 50 conversations/month. No credit card.

Install OrderWise →