Why evaluating LLMs is broken, and what to do about it.
The state of LLM evaluation today is "ask another LLM to grade the output and report a number." That works as a quick signal. It also hides three real problems that compound at scale: judges disagree with each other, judges disagree with themselves, and a single number tells you nothing about how much to trust the result. PANOPTES is built to make all three first-class.
A single number, taken at face value
Promptfoo, OpenAI Evals, LangSmith, Inspect. The major eval frameworks all share the same shape. You write a rubric, you point one LLM at the responses to score them, you get back a scalar. Usually no second judge for comparison. No resampling budget. No confidence interval. No theoretical guarantee.
- · no confidence interval
- · no second judge for comparison
- · no record of resampling variance
- · "0.85" from one judge ≠ "0.85" from another
- · 3 judges, 5 temperature samples each
- · 65% epistemic variance. Call more judges.
- · 35% aleatoric. Task itself is ambiguous.
- · conformal interval at α = 0.10, finite-sample valid
Same task, same response, three judges, three different numbers.
Real PANOPTES data below. One HumanEval problem judged by three frontier LLMs at temperature 0 (solid dots), plus several samples each at temperature 1 (hollow dots). The vertical bars are per-judge means. If "the judges all measure the same thing on the same scale" were true, those dots would stack on top of each other. They don't.
Notice the two ways the "single number" story breaks. First, the judges' point estimates are structurally different even at temperature 0. They literally see this candidate differently. Second, each judge's own sampling-pass dots spread out at temperature 1. Even a single judge isn't sure what its own number should be.
Some noise is fixable. Some isn't.
Lumping all uncertainty into one ± something hides the fact that two very different things drive it. Aleatoric uncertainty comes from genuine task ambiguity. No amount of extra sampling resolves it. Epistemic uncertainty comes from disagreement between judges, which does shrink as you call more or stronger judges. The right action depends on which one dominates.
All judges agree, each is self-consistent. Score is reliable. Move on.
Judges disagree, but each is self-consistent. Add a third judge or escalate to a stronger one.
Judges agree on average but each is internally noisy. More temperature samples will tighten the CI.
Judges disagree AND each is noisy with itself. The task itself may be ambiguous. Surface, don't average.
PANOPTES estimates this split through nested resampling. The outer bootstrap is over judges (that captures epistemic), the inner bootstrap is over temperature samples within judge (that captures aleatoric). Those numbers feed straight into the routing layer. If epistemic dominates, the bandit calls another judge. If aleatoric dominates, calling more judges won't help, so it stops.
What does a "90% confidence interval" on an LLM score even mean?
Most CI machinery assumes Gaussian residuals or large-sample asymptotics. Neither holds for an LLM judge score, which is bounded in [0, 1], heavily multimodal, and shaped by training data we don't get to see. PANOPTES sidesteps the whole problem with conformal prediction: a calibration recipe that guarantees the prediction interval contains the true value at least 1 − α of the time, under nothing more than exchangeability of the calibration set. No parametric model. No Gaussian assumption. Finite-sample valid.
Assumes the underlying distribution is Gaussian. Mostly meaningless for an LLM-judge score that's bounded, multimodal, and shaped by alignment training. Coverage is a wish, not a guarantee.
Calibrated on a held-out set, the ⌈(n+1)(1−α)⌉/n-th quantile of conformity scores gives a finite-sample-valid interval. No distributional assumption. We verify the guarantee empirically on /calibration.
A stack of statistical methods, each paper-grounded
The framework composes six layers. The bottom layer talks to LLM providers. Everything above it is statistics. Every layer cites the paper its math comes from.
Thompson-sampling bandit decides which judges are worth calling per item.
Total variance split into aleatoric (irreducible) and epistemic (reducible).
Split, adaptive (CQR), and Mondrian. Finite-sample 1−α coverage guarantees.
Hierarchical-Gaussian EM combines noisy judges into a posterior over latent quality.
Semantic entropy + Bayesian-bootstrap self-consistency at temperature 1.
Anthropic, OpenAI, Google judges, all behind one provider-agnostic Protocol.
Five stages, every evaluation
When you run panoptes eval humaneval, every item flows through these five stages. Stages 1 through 3 produce the raw signal. Stage 4 turns that signal into a calibrated posterior. Stage 5 closes the loop by deciding what to do on the next item.
Use cases and potential impact
Anthropic, OpenAI, Google all run thousands of LLM-judge evals per release. PANOPTES tells them which judgments to trust and which to escalate, instead of averaging through the noise.
If your paper says 'model X beats model Y by 4.2 points,' PANOPTES gives you an honest CI on that gap plus a calibrated p-value, instead of a brittle point estimate.
Quality control, content moderation, claim-verification. Anywhere an LLM judges another LLM's output and the result matters. Knowing when the judge isn't sure is the whole game.
The hierarchical-Gaussian aggregator exposes per-judge bias and precision as first-class outputs. PANOPTES lets you audit judge behavior, not just consume it.
The rest of the site is the empirical receipts.
Every claim on this page is backed by data on the deeper pages. The calibration page measures whether the conformal guarantee actually holds. The judges page shows real inter-judge agreement. The runs and items pages show the framework in action on real benchmark data. The methods page lists every paper.