There is a ritual I have witnessed at every company currently building with large language models, and I suspect you have witnessed it too. A product manager opens a chat window, types eight or nine questions of the sort a reasonable user might ask, reads the answers, and pronounces the feature good. Sometimes a designer is summoned to confirm the verdict. Sometimes, for the truly rigorous, an engineer tries to make it say something rude and fails. Then it ships.
This is not evaluation. This is a job interview conducted by someone who has already decided to hire the candidate.
At Intellect we ship generative AI features — triage, a chat companion, session preparation, treatment formatting — on a platform serving over four million users, in a domain where a wrong answer carries actual weight. I have written elsewhere about the governance side of that: who decides, how clinical and compliance teams get embedded into design. This essay is about the lower-altitude question that governance cannot answer for you. Once everyone agrees the feature should exist, how do you actually know it works before you let real people use it? The setting happens to be mental health. The craft is not. Everything below applies to a loan assistant, a support bot, a resume screener — anything where a model's output lands on a person.
Why doesn't trying the feature count as testing it?
With deterministic software, manual testing has a respectable logic. If the checkout flow works when you try it, it will work when a customer tries it, because the code does the same thing every time. Test once, trust forever, or at least until the next deploy.
An LLM feature deletes that logic. You are no longer testing a function; you are sampling a distribution. Ten good outputs tell you about those ten outputs and nothing else. The eleventh — produced from a slightly different phrasing, a longer conversation history, a user who writes the way actual users write rather than the way PMs write when impersonating users — is a fresh draw, and the distribution has tails.
Worse, the failures are camouflaged. These models are optimised to sound right, which is precisely what makes them dangerous to eyeball. A confident, fluent, well-structured answer and a confident, fluent, well-structured wrong answer look identical to anyone who does not already know the correct one. The demo is a sample of one, drawn by a person who is hoping it works.
A demo is an anecdote. An evaluation is a sample large enough to bet on. The entire playbook that follows is just that sentence, taken seriously.
The vignette method, or how to catch a model being inconsistent
In 2025 we published a study on PsyArXiv evaluating nine LLMs using standardised vignettes — short, controlled scenario descriptions where we knew exactly what a sound assessment looked like, because we had written the scenario. We held everything constant and varied one detail at a time: age, gender, ethnicity. Then we measured how the models' assessments moved.
They moved. Identical situations drew materially different responses when the only thing that changed was a demographic detail that should not have mattered. No amount of chatting with those models would have surfaced this, because no human tester sends the same case twice with one word changed and diffs the answers. That is the whole trick, and it generalises far beyond our domain:
Write scenarios where you know the right answer in advance. Hold everything constant. Vary one attribute. Run it at volume. Measure the deltas, not just the accuracy.
If you run a lending product, the vignette is a loan application and the varied attribute is a postcode. If you run a hiring tool, it is a CV and a name. If two users who differ only in a detail that should not matter get materially different answers, you have found a real bug — the species of bug that no demo, however enthusiastic, will ever show you. The vignette's power is that it converts "the model seems fair" from a feeling into a measurement, and measurements have the agreeable property of being arguable about in meetings.
Mapping the failure states before writing the tests
Before you can write evals, you need to know what you are examining for, and the honest answer is rarely "accuracy". It is a ranked list of ways the feature can go wrong. I call it a "harm taxonomy": where we listevery failure we can imagine, ordered by the severity of what it does to a user. A clumsy tone and a missed escalation are both on the list. But they are not ordered next to each other.
The taxonomy is usually filed under governance, but its practical function is humbler — it is like the syllabus for the eval suite. Each failure mode can then become a family of test cases, and each severity level can set a pass threshold. A feature can ship with a 95 percent pass rate on tone. It probably cannot ship with only a 95 percent pass rate on detecting that a user is in danger. In mental health, every data point is a user in distress, and not a rounding error. The pass threshold is a product decision and pretending otherwise is how teams ship things they should not.
What does a release gate for an LLM feature actually look like?
Mechanically, ours looks like this. A golden set — hundreds of curated inputs spanning the taxonomy, from the boring centre of the distribution out to the ugly tails, including the vignette pairs described above. An automated grading layer, typically a stronger model judging the feature model's outputs against rubrics, with the judge itself periodically calibrated against human raters so we know the examiner has not gone soft. Thresholds per severity tier. And a rule with no exceptions clause: if the suite fails, the feature does not ship, and there is no meeting at which somebody charismatic argues that it is probably fine.
If that sounds like continuous integration for prompts, it should. The discipline is borrowed wholesale from software engineering, applied to an artefact engineers were never asked to test before: behaviour.
The suite pays a second dividend. Model routing — deciding which tasks run on an expensive frontier model and which on something a tenth of the price — is only possible because the evals exist. Without them, every routing debate collapses into duelling vibes about which model "feels smarter". With them, the question becomes empirical: does the cheaper model pass the gate for this task or not? Treatment formatting passes on a small model. Triage does not. The eval suite is the instrument that lets you economise with a straight face.
When is human review mandatory?
Inside Intellect, every interaction is classified into a tiered risk system: Level 1 is standard, Level 2 moderate, Level 3 high-risk, and Level 3A acute — the tier at which the system stops deliberating and a human being intervenes immediately. The clinical detail is ours; the design pattern underneath is portable to any AI product. Automation confidence must run inversely to severity, and the handoff to a human should be designed before the feature is, not bolted on after the first incident.
In practice, human review is mandatory in three places, and I would defend all three in any domain:
First, the top severity tier in production — always, regardless of how confident the model claims to be. Our 3A path ends at a person within minutes. Yours might end at a fraud analyst or a lawyer. Every product has a 3A; the only question is whether you have named it.
Second, before any genuinely new class of feature launches, humans grade a large sample of outputs by hand. Automated judges are calibrated on failure modes you have already met. A new feature class produces new ones, and the first hundred of those need human eyes, not a rubric written for the previous feature.
Third, a standing sampled audit of production outputs, forever. Models drift, prompts accrete, user behaviour mutates. The eval suite tells you the feature was good when you shipped it. Only sampling tells you it still is.
Prompts are code. Test them like code.
A one-word change to a system prompt is a deploy. It alters production behaviour exactly as surely as a code change does, and yet I have watched teams that would never merge a pull request without tests cheerfully rewrite a prompt at 6pm and go home. The fix is unglamorous: version the prompts, run the full suite on every change, and treat a regression as a failed build rather than an interesting observation.
Model upgrades deserve the same paranoia at greater scale. Swapping the underlying model is a platform migration wearing a changelog that says "improved". We once took an upgrade that was better on nearly every measure we tracked — and the suite caught it behaving subtly differently on one escalation path. No human chatting with the feature would have noticed, because every individual answer still looked excellent. The diff only existed in aggregate, which is the only place these models can really be seen.
The eval suite is the only institutional memory an AI product has. People leave, prompts get rewritten, models get swapped under you on a vendor's schedule rather than your own. The suite is the one artefact that remembers every way the product has previously failed and checks, every single release, that it has not started failing those ways again.
The three papers we have put out — the bias evaluation across nine models, a protocol for cutting the time between detecting risk and acting on it, a composite framework for measuring provider quality — look, from the outside, like academic output. They are really just lab notes from the same habit: refusing to take the model's word for it, then writing down what we found.
The models will keep improving, which is the genuinely dangerous part. Every improvement makes the demo more convincing and the eyeball test less reliable, so the better the model gets, the better your exam has to be. Ours gets longer every quarter. I no longer expect that to stop.