What we built, who it's for, what comes next.
PANOPTES is a Python framework for evaluating LLMs that treats every score as a statistical inference problem. The receipts are on the other pages. This page is the closing slide: a recap of the surface area, the four kinds of people who actually need this, and the work that's still on the table.
Eight statistical methods. Four providers. One async Python framework.
The framework composes existing research into a single tool that anyone running LLM-graded evals can drop in. Every method cites the paper it comes from. Every number on this page is measured, not estimated.
Who actually needs this
The framework is built for any workflow where the answer to "how good was that LLM output?" matters enough to need a confidence interval, not just a number. Four kinds of users were on my mind while building it.
Frontier labs run thousands of LLM-judge evals per release. PANOPTES tells them which judgments to trust, which to escalate to a stronger judge, and which to flag as inherently ambiguous. Cuts spend on a fixed quality bar by skipping the judge calls that wouldn't have moved the posterior anyway.
If your paper says model X beats model Y by 4.2 points, a reviewer is going to ask how big the noise floor is. PANOPTES gives you an honest CI on the gap, a paired-bootstrap rank correlation against the held-out set, and a permutation p-value for whether the difference is real.
If your product has an LLM grading another LLM's output and the result is shown to a user, the cost of 'looks confident, actually wrong' is high. PANOPTES surfaces the cases where the judge isn't sure so you can hand them off to a human or fall back to a stricter rule.
The hierarchical-Gaussian aggregator exposes per-judge bias and precision as first-class outputs. You can audit which judges are running hot vs cold, who's noisier than who, and how disagreement structure shifts across task families. Semantic entropy gives you a hallucination signal grounded in the Farquhar 2024 paper.
Why this matters beyond one project
If an eval framework reports 'model X scored 0.85,' that 0.85 should be reproducible from primary sources. PANOPTES keeps every judge call, every rationale, and every prompt hash in duckdb. The same query reproduces the same number.
A finite-sample-valid CI is a much stronger claim than a 'looks roughly right' point estimate. Once teams habituate to expecting intervals, the threshold for over-claiming on a benchmark goes up.
The bandit routing saves cost on items where the cheap judges already agree. Calling 3 frontier LLMs per item is fine at n=100, painful at n=100,000. Smart routing makes large-n evals economically viable.
What's next
The framework hits its v1 surface area. What's left is largely measurement at scale: bigger calibration sets, more benchmarks, harder candidate distributions. The next round of work also rounds out the aggregator stack for Likert-scale rubrics and hardens the code-execution sandbox.
The current probe has 25 items in the held-out test set. That's enough to demonstrate the framework, but the 2-percentage-point gap at α=0.10 has a ±6pp standard error. Scaling to 200+ items tightens the SE to ~2pp, which would let me make stronger claims about calibration quality.
Continuous [0, 1] scores use the hierarchical-Gaussian aggregator. Likert 1–5 scores currently get normalized to [0, 1] and treated as continuous, which loses the ordinal structure. MACE-style ordinal Dawid-Skene (Hovy et al. NAACL 2013) is the right tool. The math is straightforward; just hasn't been wired.
src/panoptes/uq/disagreement.py as a sibling class.The current sandbox uses subprocess + resource.setrlimit. That's safe enough for grading canonical solutions but I wouldn't run untrusted user-submitted code through it. A Docker backend behind the existing Sandbox Protocol gives proper isolation.
Benchmark loaders exist for all five. Only HumanEval and the calibration probe are currently wired through the CLI. The blocker is that each benchmark needs a benchmark-specific rubric prompt and a candidate-generation step; both are mechanical.
The whole framing is 'finite-sample guarantees on LLM eval.' That claim is only credible with published, replicable numbers. A short technical writeup with the calibration table, the bandit-vs-all-judges cost comparison, and the methodology is the right vehicle for that.
A lot of teams already have a Promptfoo or Inspect pipeline. PANOPTES doesn't need to replace those; it can sit on top, taking the (item, response, judge_score) records they produce and emitting the UQ + conformal layer on top. Shipping shims for the major frameworks lowers the adoption cost dramatically.
A few honest takeaways
On the same item at temperature 0, three frontier judges routinely disagree by 0.2 on a [0,1] scale. That's a much bigger signal than I assumed going in. The case for treating LLM-as-judge as a statistical problem isn't theoretical; it's empirically obvious the moment you call more than one judge.
I'd expected the conformal coverage guarantee to fail on real LLM-judge data because exchangeability is a strong assumption. Empirically the coverage tracks nominal almost exactly at α=0.1. The theorem is unreasonably effective here.
The Thompson-sampling bandit ran in only one production setting, with low n. I believe the cost-reduction claim is real but I'd want a 200-item, multi-strategy A/B before publishing the number.
The whole framework, end to end
Source code, citations, and the calibration script are all on GitHub. The framework is MIT-licensed and intentionally minimal in its public surface. If you want to swap in your own judges, your own benchmark, your own routing strategy: implement the corresponding Protocol class and the rest is free.