summary

What we built, who it's for, what comes next.

PANOPTES is a Python framework for evaluating LLMs that treats every score as a statistical inference problem. The receipts are on the other pages. This page is the closing slide: a recap of the surface area, the four kinds of people who actually need this, and the work that's still on the table.

what we built

Eight statistical methods. Four providers. One async Python framework.

The framework composes existing research into a single tool that anyone running LLM-graded evals can drop in. Every method cites the paper it comes from. Every number on this page is measured, not estimated.

statistical methods

conformal · semantic entropy · self-consistency · DS · decomp · bandit · routing · calibration

LLM providers

Anthropic · OpenAI · Google · OpenAI-compatible

benchmarks wired

HumanEval · MBPP · GSM8K · MT-Bench · TruthfulQA

empirical coverage

92%

at 90% nominal, 2.0pp gap

evaluation runs

115 items judged

LLM judge calls

1,640

total spend

$8.00

judges in the pool

applications

Who actually needs this

The framework is built for any workflow where the answer to "how good was that LLM output?" matters enough to need a confidence interval, not just a number. Four kinds of users were on my mind while building it.

Model providers running safety and capability evals at scale

Anthropic, OpenAI, Google internal eval pipelines

Frontier labs run thousands of LLM-judge evals per release. PANOPTES tells them which judgments to trust, which to escalate to a stronger judge, and which to flag as inherently ambiguous. Cuts spend on a fixed quality bar by skipping the judge calls that wouldn't have moved the posterior anyway.

concretely:

Replace the 'mean over 3 judges' baseline with hierarchical-Gaussian aggregation, save 30–40% of judge calls via the bandit, get calibrated CIs on every claim.

Benchmark authors who need to defend their numbers

anyone publishing 'model X beats Y on benchmark Z'

If your paper says model X beats model Y by 4.2 points, a reviewer is going to ask how big the noise floor is. PANOPTES gives you an honest CI on the gap, a paired-bootstrap rank correlation against the held-out set, and a permutation p-value for whether the difference is real.

concretely:

One CLI invocation produces a coverage table, a reliability diagram, and a methods.md you can drop into the appendix.

Eng teams shipping LLM-graded user-facing pipelines

content moderation, code review, claim verification, document QA

If your product has an LLM grading another LLM's output and the result is shown to a user, the cost of 'looks confident, actually wrong' is high. PANOPTES surfaces the cases where the judge isn't sure so you can hand them off to a human or fall back to a stricter rule.

concretely:

Wrap your existing judge in a Judge Protocol class, get the full UQ stack for free. Items above an epistemic-variance threshold get routed to escalation.

Researchers studying judge bias, alignment, or evaluation methodology

anyone writing a paper about LLM judges

The hierarchical-Gaussian aggregator exposes per-judge bias and precision as first-class outputs. You can audit which judges are running hot vs cold, who's noisier than who, and how disagreement structure shifts across task families. Semantic entropy gives you a hallucination signal grounded in the Farquhar 2024 paper.

concretely:

duckdb result store + jupyter-friendly queries means every claim in your paper is one SQL query away from the raw judge call that produced it.

impact

Why this matters beyond one project

Auditability

If an eval framework reports 'model X scored 0.85,' that 0.85 should be reproducible from primary sources. PANOPTES keeps every judge call, every rationale, and every prompt hash in duckdb. The same query reproduces the same number.

Honesty

A finite-sample-valid CI is a much stronger claim than a 'looks roughly right' point estimate. Once teams habituate to expecting intervals, the threshold for over-claiming on a benchmark goes up.

Efficiency

The bandit routing saves cost on items where the cheap judges already agree. Calling 3 frontier LLMs per item is fine at n=100, painful at n=100,000. Smart routing makes large-n evals economically viable.

future work

What's next

The framework hits its v1 surface area. What's left is largely measurement at scale: bigger calibration sets, more benchmarks, harder candidate distributions. The next round of work also rounds out the aggregator stack for Likert-scale rubrics and hardens the code-execution sandbox.

Bigger calibration probe (n=25 → 200+)

next~$50 in API spend

The current probe has 25 items in the held-out test set. That's enough to demonstrate the framework, but the 2-percentage-point gap at α=0.10 has a ±6pp standard error. Scaling to 200+ items tightens the SE to ~2pp, which would let me make stronger claims about calibration quality.

Blocked on: nothing. Just compute. Would take ~30 minutes to run.

Ordinal Dawid-Skene aggregator for Likert-scale rubrics

next~4 hours of code

Continuous [0, 1] scores use the hierarchical-Gaussian aggregator. Likert 1–5 scores currently get normalized to [0, 1] and treated as continuous, which loses the ordinal structure. MACE-style ordinal Dawid-Skene (Hovy et al. NAACL 2013) is the right tool. The math is straightforward; just hasn't been wired.

Will live in src/panoptes/uq/disagreement.py as a sibling class.

Docker-isolated sandbox for code execution

soon~1 day of code + ops

The current sandbox uses subprocess + resource.setrlimit. That's safe enough for grading canonical solutions but I wouldn't run untrusted user-submitted code through it. A Docker backend behind the existing Sandbox Protocol gives proper isolation.

Sandbox Protocol is already in place; this is a new backend impl, no API changes.

Wire MBPP / GSM8K / MT-Bench / TruthfulQA into the CLI

soon~1 day

Benchmark loaders exist for all five. Only HumanEval and the calibration probe are currently wired through the CLI. The blocker is that each benchmark needs a benchmark-specific rubric prompt and a candidate-generation step; both are mechanical.

Once MBPP and GSM8K are wired, the Mondrian conformal aggregator across task families becomes much more interesting.

Short paper with measured calibration numbers

later~1 week

The whole framing is 'finite-sample guarantees on LLM eval.' That claim is only credible with published, replicable numbers. A short technical writeup with the calibration table, the bandit-vs-all-judges cost comparison, and the methodology is the right vehicle for that.

Will write after the calibration probe scales to 200+.

Integration shims for Promptfoo / Inspect / LangSmith

later~2 days each

A lot of teams already have a Promptfoo or Inspect pipeline. PANOPTES doesn't need to replace those; it can sit on top, taking the (item, response, judge_score) records they produce and emitting the UQ + conformal layer on top. Shipping shims for the major frameworks lowers the adoption cost dramatically.

what surprised me

A few honest takeaways

LLM judges are noisier than I expected

On the same item at temperature 0, three frontier judges routinely disagree by 0.2 on a [0,1] scale. That's a much bigger signal than I assumed going in. The case for treating LLM-as-judge as a statistical problem isn't theoretical; it's empirically obvious the moment you call more than one judge.

Conformal works basically out of the box

I'd expected the conformal coverage guarantee to fail on real LLM-judge data because exchangeability is a strong assumption. Empirically the coverage tracks nominal almost exactly at α=0.1. The theorem is unreasonably effective here.

The bandit story needs more data

The Thompson-sampling bandit ran in only one production setting, with low n. I believe the cost-reduction claim is real but I'd want a 200-item, multi-strategy A/B before publishing the number.

thanks

The whole framework, end to end

Source code, citations, and the calibration script are all on GitHub. The framework is MIT-licensed and intentionally minimal in its public surface. If you want to swap in your own judges, your own benchmark, your own routing strategy: implement the corresponding Protocol class and the rest is free.

github.com/tonywangs/panoptes

paper citations

calibration receipts