PANOPTES

Uncertainty-aware LLM evaluation

When you grade an LLM with another LLM, you get a single number. That number is noisier than it looks. PANOPTES treats the score as a posterior distribution instead, splits the noise into the part that's fixable and the part that isn't, and wraps the whole thing in a finite-sample coverage guarantee.

start with the background

see the calibration result

browse runs

headline result

split conformal · held-out calibration probe · obfuscated HumanEval

92% empirical coverage at 90% nominal

2.0pp gap. Finite-sample guaranteed under exchangeability.

The framework claims that conformal-prediction intervals on judge scores carry real frequentist coverage. We verified that by obfuscating HumanEval so judges can't pattern-match memorized solutions, generating candidates with gpt-4o-mini, grading them in a sandboxed Python executor for ground truth, and measuring how often a 90% interval contains the truth on a held-out set.

claude-sonnetα = 0.1·n_test = 25full coverage table and reliability diagram

at a glance

evaluation runs

115 items judged total

LLM judge calls

1,640

total spend

$8.00

distinct judges

claude-haiku, claude-sonnet, gpt-4o-mini, gpt-4o

routing tradeoff

Bandit vs. all-judges

Each dot is one run. The y-axis is cost per item. The x-axis is the size of the judge pool the run had access to. Calling every judge on every item lands you in the upper-right corner. The Thompson-sampling bandit aims for the lower-right: more judges available, but smarter decisions about which to actually call, so cost per item stays flat or drops.

Dot size encodes the number of items in the run.

why this tradeoff matters

allbandit

runs you can drill into

see all →

Each card is one panoptes eval invocation. Click in for the cost breakdown, the score distribution, and the item-by-item dashboard.