PANOPTES
PANOPTES

Uncertainty-aware LLM evaluation

When you grade an LLM with another LLM, you get a single number. That number is noisier than it looks. PANOPTES treats the score as a posterior distribution instead, splits the noise into the part that's fixable and the part that isn't, and wraps the whole thing in a finite-sample coverage guarantee.

headline result
split conformal · held-out calibration probe · obfuscated HumanEval
92% empirical coverage at 90% nominal
2.0pp gap. Finite-sample guaranteed under exchangeability.

The framework claims that conformal-prediction intervals on judge scores carry real frequentist coverage. We verified that by obfuscating HumanEval so judges can't pattern-match memorized solutions, generating candidates with gpt-4o-mini, grading them in a sandboxed Python executor for ground truth, and measuring how often a 90% interval contains the truth on a held-out set.

claude-sonnetα = 0.1·n_test = 25full coverage table and reliability diagram
at a glance
evaluation runs
6
115 items judged total
LLM judge calls
1,640
total spend
$8.00
distinct judges
4
claude-haiku, claude-sonnet, gpt-4o-mini, gpt-4o
routing tradeoff

Bandit vs. all-judges

Each dot is one run. The y-axis is cost per item. The x-axis is the size of the judge pool the run had access to. Calling every judge on every item lands you in the upper-right corner. The Thompson-sampling bandit aims for the lower-right: more judges available, but smarter decisions about which to actually call, so cost per item stays flat or drops.

Dot size encodes the number of items in the run.

why this tradeoff matters
allbandit
runs you can drill into
see all →

Each card is one panoptes eval invocation. Click in for the cost breakdown, the score distribution, and the item-by-item dashboard.

what to look at next