background

Why evaluating LLMs is broken, and what to do about it.

The state of LLM evaluation today is "ask another LLM to grade the output and report a number." That works as a quick signal. It also hides three real problems that compound at scale: judges disagree with each other, judges disagree with themselves, and a single number tells you nothing about how much to trust the result. PANOPTES is built to make all three first-class.

what the field looks like today

A single number, taken at face value

Promptfoo, OpenAI Evals, LangSmith, Inspect. The major eval frameworks all share the same shape. You write a rubric, you point one LLM at the responses to score them, you get back a scalar. Usually no second judge for comparison. No resampling budget. No confidence interval. No theoretical guarantee.

most eval frameworks

"GPT-4 graded it 0.85."

0.85

· no confidence interval
· no second judge for comparison
· no record of resampling variance
· "0.85" from one judge ≠ "0.85" from another

PANOPTES

"True quality is in [0.71, 0.93] with 90% coverage."

0.82± 0.11

· 3 judges, 5 temperature samples each
· 65% epistemic variance. Call more judges.
· 35% aleatoric. Task itself is ambiguous.
· conformal interval at α = 0.10, finite-sample valid

problem 1 · judges are noisy

Same task, same response, three judges, three different numbers.

Real PANOPTES data below. One HumanEval problem judged by three frontier LLMs at temperature 0 (solid dots), plus several samples each at temperature 1 (hollow dots). The vertical bars are per-judge means. If "the judges all measure the same thing on the same scale" were true, those dots would stack on top of each other. They don't.

item HumanEval/11

range: 0.500

point pass (temp 0) sampling draws (temp 1) mean

Notice the two ways the "single number" story breaks. First, the judges' point estimates are structurally different even at temperature 0. They literally see this candidate differently. Second, each judge's own sampling-pass dots spread out at temperature 1. Even a single judge isn't sure what its own number should be.

problem 2 · two kinds of uncertainty

Some noise is fixable. Some isn't.

Lumping all uncertainty into one ± something hides the fact that two very different things drive it. Aleatoric uncertainty comes from genuine task ambiguity. No amount of extra sampling resolves it. Epistemic uncertainty comes from disagreement between judges, which does shrink as you call more or stronger judges. The right action depends on which one dominates.

trust the score

aleatoric: low·epistemic: low

All judges agree, each is self-consistent. Score is reliable. Move on.

call more judges

route

aleatoric: low·epistemic: high

Judges disagree, but each is self-consistent. Add a third judge or escalate to a stronger one.

sample again

resample

aleatoric: high·epistemic: low

Judges agree on average but each is internally noisy. More temperature samples will tighten the CI.

flag the item

skip / surface

aleatoric: high·epistemic: high

Judges disagree AND each is noisy with itself. The task itself may be ambiguous. Surface, don't average.

PANOPTES estimates this split through nested resampling. The outer bootstrap is over judges (that captures epistemic), the inner bootstrap is over temperature samples within judge (that captures aleatoric). Those numbers feed straight into the routing layer. If epistemic dominates, the bandit calls another judge. If aleatoric dominates, calling more judges won't help, so it stops.

problem 3 · no guarantees

What does a "90% confidence interval" on an LLM score even mean?

Most CI machinery assumes Gaussian residuals or large-sample asymptotics. Neither holds for an LLM judge score, which is bounded in [0, 1], heavily multimodal, and shaped by training data we don't get to see. PANOPTES sidesteps the whole problem with conformal prediction: a calibration recipe that guarantees the prediction interval contains the true value at least 1 − α of the time, under nothing more than exchangeability of the calibration set. No parametric model. No Gaussian assumption. Finite-sample valid.

a typical "± 2σ" CI

Assumes the underlying distribution is Gaussian. Mostly meaningless for an LLM-judge score that's bounded, multimodal, and shaped by alignment training. Coverage is a wish, not a guarantee.

conformal interval

Calibrated on a held-out set, the ⌈(n+1)(1−α)⌉/n-th quantile of conformity scores gives a finite-sample-valid interval. No distributional assumption. We verify the guarantee empirically on /calibration.

how PANOPTES addresses all three

A stack of statistical methods, each paper-grounded

The framework composes six layers. The bottom layer talks to LLM providers. Everything above it is statistics. Every layer cites the paper its math comes from.

Routing

Thompson-sampling bandit decides which judges are worth calling per item.

Russo & Van Roy 2018

Decomposition

Total variance split into aleatoric (irreducible) and epistemic (reducible).

Kendall & Gal 2017

Conformal prediction

Split, adaptive (CQR), and Mondrian. Finite-sample 1−α coverage guarantees.

Vovk/Gammerman/Shafer 2005 · Romano/Patterson/Candès 2019

Aggregation

Hierarchical-Gaussian EM combines noisy judges into a posterior over latent quality.

Dawid & Skene 1979

Sampling-UQ

Semantic entropy + Bayesian-bootstrap self-consistency at temperature 1.

Farquhar et al. Nature 2024 · Rubin 1981

Heterogeneous jury

Anthropic, OpenAI, Google judges, all behind one provider-agnostic Protocol.

PANOPTES infra

at runtime, one item at a time

Five stages, every evaluation

When you run panoptes eval humaneval, every item flows through these five stages. Stages 1 through 3 produce the raw signal. Stage 4 turns that signal into a calibrated posterior. Stage 5 closes the loop by deciding what to do on the next item.

Task + candidate

An LLM-generated answer to a task. The thing we want to evaluate.

Heterogeneous jury

Multiple judges (Anthropic, OpenAI, Google) score it on the [0, 1] scale via tool-use structured output.

Sampling pass

Each judge is sampled k times at temperature 1, giving us a distribution rather than a point.

Decompose + calibrate

Aleatoric vs epistemic split. Conformal-prediction interval with finite-sample coverage at 1 − α.

Smart routing

Thompson-sampling bandit learns which judges give the most information per dollar, and stops there.

who actually needs this

Use cases and potential impact

Model providers running safety and capability evals at scale

Anthropic, OpenAI, Google all run thousands of LLM-judge evals per release. PANOPTES tells them which judgments to trust and which to escalate, instead of averaging through the noise.

Benchmark authors who need to defend their numbers

If your paper says 'model X beats model Y by 4.2 points,' PANOPTES gives you an honest CI on that gap plus a calibrated p-value, instead of a brittle point estimate.

Eng teams shipping LLM-graded user-facing pipelines

Quality control, content moderation, claim-verification. Anywhere an LLM judges another LLM's output and the result matters. Knowing when the judge isn't sure is the whole game.

Researchers studying judge bias or alignment

The hierarchical-Gaussian aggregator exposes per-judge bias and precision as first-class outputs. PANOPTES lets you audit judge behavior, not just consume it.

where to go from here

The rest of the site is the empirical receipts.

Every claim on this page is backed by data on the deeper pages. The calibration page measures whether the conformal guarantee actually holds. The judges page shows real inter-judge agreement. The runs and items pages show the framework in action on real benchmark data. The methods page lists every paper.

see the calibration result browse runs paper citations