Skip to content
ai

Evals, Not Vibes: What It Means to Actually Measure What AI Knows

We judge AI by how it feels, not whether it is right. Here is what an evaluation really is, why grounding makes it possible, and why I named my system Eternal Evals.

Evals, Not Vibes: What It Means to Actually Measure What AI Knows

“We have built the most powerful language machines in history and we mostly assess them by how they make us feel. A good demo is not a measurement. It is a mood.”


The Vibes Economy

Watch how people actually decide whether an AI is good. They read a few answers, notice that it is fluent, articulate, confident—and they trust it. The judgment is aesthetic. It is a vibe.

This is understandable and completely inadequate. Fluency is precisely the thing a language model optimises for, so judging it by fluency is like judging a liar by how smoothly they speak. The smoother, the more trusted; the more trusted, the more dangerous when wrong. I made this argument about hallucination: a system that is wrong rarely, in beautiful prose, earns a trust it has not earned.

A demo shows you the best case, delivered confidently. An evaluation shows you the failure rate, delivered honestly. We keep choosing the demo and calling it evidence.

The entire consumer AI experience is tuned to produce the vibe and hide the failure rate. That is not a conspiracy; it is just what happens when the incentive is engagement and the interface is a confident sentence.

What an Eval Actually Is

The antidote is old and unglamorous: evaluation.

Eval (evaluation) technology

In AI, an eval is a systematic test of whether a system produces correct outputs, measured against known ground truth rather than judged by impression. It replaces “does this sound right?” with “is this right, and how often is it wrong?”—turning a feeling into a number you can act on.

An eval is the discipline of checking. It asks the questions the vibe suppresses: Against what truth are we comparing? How often does the system fail? Where does it fail? Is it failing more or less than yesterday? This is the difference between engineering and theatre—between a system you can trust with something real and a system that merely performs trustworthiness.

The reason this matters is not academic. As AI moves from novelty to infrastructure—into decisions about health, money, law, and how people understand themselves—“it seemed right” stops being an acceptable standard. You would not accept it from a bridge. We should not accept it from a system people consult about their lives.

Grounding Is What Makes Evaluation Possible

Here is the connection that took me a while to see clearly, and it changed how I build. You can only evaluate what you can trace.

If an AI produces a free-floating claim—an assertion connected to nothing—there is nothing to check it against. But in a computation-first system, every factual claim traces back to a real computation. The AI says a planet is weak; that resolves to an actual strength score. It says a period begins in a certain year; that resolves to a dated calculation. Because each claim is anchored, each claim is checkable—which means the whole system is evaluable.

Grounding and evaluation are two sides of one coin. Ground your facts in computation, and you get honesty and the ability to measure it. Leave them floating, and you get fluency you can neither trust nor test.

The Trap of Judging AI With AI

The industry’s fashionable shortcut is “LLM-as-judge”—using one language model to grade another. It is useful for scale, and I use it carefully, but it carries an obvious hazard that the vibe economy loves to ignore: you are using a guesser to grade a guesser.

This is why, for anything checkable, the judge has to be the deterministic engine, not a second model’s impression. The eval’s authority comes from the ground truth behind it, and ground truth is computed, retrieved, or observed—never merely asserted, however eloquently.

Why I Named It Eternal Evals

All of this is packed into the name. I tell the founder story in The Machine That Refused to Guess, but the name itself was the thesis.

Evals, because I wanted a system defined by measurement rather than mood—every claim traceable, every number auditable, the whole thing built to be checked rather than admired. Eternal, because the questions it engages are the old ones: timing, character, meaning, the shape of a life. Holding the eternal questions to an honest, evaluable standard, instead of surrendering them to whoever writes the most confident sentence, felt like the entire point.

Evaluating Yourself

There is a reflexive turn here that I cannot resist, because it is really the deepest layer. The same discipline that separates good AI from confident nonsense also separates self-knowledge from self-flattery.

Much of what passes for insight—about ourselves, our futures, our charts—is a comfortable vibe we do not check. A good horoscope, like a good chatbot, is optimised to feel true. The harder, better path is the evaluated one: grounding what you believe about yourself in something real, and being willing to find out you were wrong. That is what I explored in asking whether AI can be conscious—the value was never in the machine’s certainty, but in the honesty of the inquiry.

An eval, in the end, is just intellectual honesty with a number attached. We could use a great deal more of it—in our machines, and in ourselves. You can judge the engine at eternalevals.com the way it asks to be judged: not by whether it sounds wise, but by whether it is true.


Frequently Asked Questions

Loading conversations...