An eval, or evaluation, is a systematic test of whether an AI system produces correct outputs, measured against known ground truth instead of judged by impression. It replaces 'does this sound right?' with 'is this right, and how often is it wrong?'—converting a subjective feeling about a model into a number you can act on.

Why isn't a good demo enough to trust an AI?

A demo shows the best case, delivered confidently, while hiding the failure rate. Language models optimise for fluency, so a smooth, articulate answer is exactly what they produce whether or not it is correct. Judging AI by how it sounds rewards the very quality that disguises its errors. Real trust requires measuring how and how often it fails.

What is the problem with using an LLM to judge another LLM?

LLM-as-judge can assess fluency and coherence but cannot establish external truth—the judging model has no more access to the real fact than the model it grades. For anything checkable, the judge must be a computation, retrieval, or observation that provides genuine ground truth, not another model's confident opinion.

How does grounding relate to evaluation?

You can only evaluate what you can trace. In a computation-first system every factual claim resolves back to a real computed value, so each claim is checkable and the whole system becomes measurable. Grounding facts in computation delivers both honesty and the ability to verify it; leaving claims ungrounded makes them impossible to test.

Evals, Not Vibes: What It Means to Actually Measure What AI Knows

“We have built the most powerful language machines in history and we mostly assess them by how they make us feel. A good demo is not a measurement. It is a mood.”

The Vibes Economy

Watch how people actually decide whether an AI is good. They read a few answers, notice that it is fluent, articulate, confident—and they trust it. The judgment is aesthetic. It is a vibe.

This is understandable and completely inadequate. Fluency is precisely the thing a language model optimises for, so judging it by fluency is like judging a liar by how smoothly they speak. The smoother, the more trusted; the more trusted, the more dangerous when wrong. I made this argument about hallucination: a system that is wrong rarely, in beautiful prose, earns a trust it has not earned.

A demo shows you the best case, delivered confidently. An evaluation shows you the failure rate, delivered honestly. We keep choosing the demo and calling it evidence.

The entire consumer AI experience is tuned to produce the vibe and hide the failure rate. That is not a conspiracy; it is just what happens when the incentive is engagement and the interface is a confident sentence.

What an Eval Actually Is

The antidote is old and unglamorous: evaluation.

Eval (evaluation) technology: In AI, an eval is a systematic test of whether a system produces correct outputs, measured against known ground truth rather than judged by impression. It replaces “does this sound right?” with “is this right, and how often is it wrong?”—turning a feeling into a number you can act on.

An eval is the discipline of checking. It asks the questions the vibe suppresses: Against what truth are we comparing? How often does the system fail? Where does it fail? Is it failing more or less than yesterday? This is the difference between engineering and theatre—between a system you can trust with something real and a system that merely performs trustworthiness.

The reason this matters is not academic. As AI moves from novelty to infrastructure—into decisions about health, money, law, and how people understand themselves—“it seemed right” stops being an acceptable standard. You would not accept it from a bridge. We should not accept it from a system people consult about their lives.

Grounding Is What Makes Evaluation Possible

Here is the connection that took me a while to see clearly, and it changed how I build. You can only evaluate what you can trace.

If an AI produces a free-floating claim—an assertion connected to nothing—there is nothing to check it against. But in a computation-first system, every factual claim traces back to a real computation. The AI says a planet is weak; that resolves to an actual strength score. It says a period begins in a certain year; that resolves to a dated calculation. Because each claim is anchored, each claim is checkable—which means the whole system is evaluable.

Grounding and evaluation are two sides of one coin. Ground your facts in computation, and you get honesty and the ability to measure it. Leave them floating, and you get fluency you can neither trust nor test.

The Trap of Judging AI With AI

The industry’s fashionable shortcut is “LLM-as-judge”—using one language model to grade another. It is useful for scale, and I use it carefully, but it carries an obvious hazard that the vibe economy loves to ignore: you are using a guesser to grade a guesser.

This is why, for anything checkable, the judge has to be the deterministic engine, not a second model’s impression. The eval’s authority comes from the ground truth behind it, and ground truth is computed, retrieved, or observed—never merely asserted, however eloquently.

Why I Named It Eternal Evals

All of this is packed into the name. I tell the founder story in The Machine That Refused to Guess, but the name itself was the thesis.

Evals, because I wanted a system defined by measurement rather than mood—every claim traceable, every number auditable, the whole thing built to be checked rather than admired. Eternal, because the questions it engages are the old ones: timing, character, meaning, the shape of a life. Holding the eternal questions to an honest, evaluable standard, instead of surrendering them to whoever writes the most confident sentence, felt like the entire point.

Evaluating Yourself

There is a reflexive turn here that I cannot resist, because it is really the deepest layer. The same discipline that separates good AI from confident nonsense also separates self-knowledge from self-flattery.

Much of what passes for insight—about ourselves, our futures, our charts—is a comfortable vibe we do not check. A good horoscope, like a good chatbot, is optimised to feel true. The harder, better path is the evaluated one: grounding what you believe about yourself in something real, and being willing to find out you were wrong. That is what I explored in asking whether AI can be conscious—the value was never in the machine’s certainty, but in the honesty of the inquiry.

An eval, in the end, is just intellectual honesty with a number attached. We could use a great deal more of it—in our machines, and in ourselves. You can judge the engine at eternalevals.com the way it asks to be judged: not by whether it sounds wise, but by whether it is true.

Evals, Not Vibes: What It Means to Actually Measure What AI Knows

On this page

The Vibes Economy

What an Eval Actually Is

Grounding Is What Makes Evaluation Possible

The Trap of Judging AI With AI

Why I Named It Eternal Evals

Evaluating Yourself

Frequently Asked Questions

On this page

The Vibes Economy

What an Eval Actually Is

Grounding Is What Makes Evaluation Possible

The Trap of Judging AI With AI

Why I Named It Eternal Evals

Evaluating Yourself

Frequently Asked Questions

What is an AI eval?

Why isn't a good demo enough to trust an AI?

What is the problem with using an LLM to judge another LLM?

How does grounding relate to evaluation?