Uncertainty-aware LLM evaluation
When you grade an LLM with another LLM, you get a single number. That number is noisier than it looks. PANOPTES treats the score as a posterior distribution instead, splits the noise into the part that's fixable and the part that isn't, and wraps the whole thing in a finite-sample coverage guarantee.
The framework claims that conformal-prediction intervals on judge scores carry real frequentist coverage. We verified that by obfuscating HumanEval so judges can't pattern-match memorized solutions, generating candidates with gpt-4o-mini, grading them in a sandboxed Python executor for ground truth, and measuring how often a 90% interval contains the truth on a held-out set.
Bandit vs. all-judges
Each dot is one run. The y-axis is cost per item. The x-axis is the size of the judge pool the run had access to. Calling every judge on every item lands you in the upper-right corner. The Thompson-sampling bandit aims for the lower-right: more judges available, but smarter decisions about which to actually call, so cost per item stays flat or drops.
Dot size encodes the number of items in the run.
why this tradeoff mattersEach card is one panoptes eval invocation. Click in for the cost breakdown, the score distribution, and the item-by-item dashboard.