calibration probe

Does the 90% interval actually contain the truth 90% of the time?

Conformal prediction guarantees, on paper, that the prediction interval contains the true value at least 1 − α of the time. This page measures whether that guarantee actually holds on a real held-out test set, with real ground-truth labels.

headline · split conformal · Claude Sonnet

92% empirical coverage at the nominal 90% target · 2.0pp gap

Split conformal achieved 92% empirical coverage at the nominal 90% target on the held-out test split of the obfuscated-HumanEval calibration probe. The 2pp gap meets the v1.0 spec target.

conformal prediction in 90 seconds

1Take a held-out calibration set with known labels. For each (item, judge), compute a conformity score. Here it's the absolute difference between the judge's score and the ground-truth label.
2Sort those conformity scores. Take the ⌈(n + 1)(1 − α)⌉ / n-th empirical quantile. Call it q.
3On a new (test) item, the prediction interval is [ŷ − q, ŷ + q]. That's it.

The guarantee: under exchangeability of calibration and test data, the true label falls inside that interval with probability ≥ 1 − α. No Gaussian assumption, no parametric model. Finite-sample valid.

why this benchmark is the right test

•Obfuscated HumanEval. Every problem's entry-point function is renamed to an opaque hash, so judges can't pattern-match memorized solutions. (Memorized solutions would inflate scores artificially.)
•Real ground truth. Each candidate solution is executed in a sandboxed Python subprocess against the rewritten test block. Pass / fail is mechanical, not another LLM's opinion.
•50 / 50 split. Half the items fit the conformal quantile; the other half measures whether the quantile actually holds. The deterministic seed makes the split reproducible.
•Honest noise. With n_test ≈ 25, the standard error on an empirical coverage estimate is ≈ ±6pp. The 2pp gap is consistent with valid coverage, not a claim of perfect calibration.

reliability diagram

Empirical vs nominal, every α

Each dot is one (judge, α) measurement. The dashed diagonal is "perfect" calibration, where empirical equals nominal. The shaded green region above the diagonal is the safe-side direction (over-covers, conservative). Conformal's theorem says points should fall in the green region or on the line. The failure mode is points falling below.

claude-sonnetgpt-4oovercoverage (safe)

Read this as: at the bottom-left (α = 0.4), the target is 60% coverage. At the top-right (α = 0.05), the target is 95% coverage. The dots cluster on or above the line, exactly where the theorem says they should be.

full coverage table

Per-judge × per-α

Green = empirical is within 5pp of nominal. Amber = within 10pp. Red = more than 10pp off. Over-covering counts as fine. The theorem is a lower bound, not equality. The one row that matters most for the v1.0 spec target is Claude at α = 0.10.

judge	α	nominal (1−α)	empirical	\|emp − nom\|	q	n_cal	n_test
claude-sonnet	0.05	0.95	1.00	5.0pp	0.500	25	25
claude-sonnet	0.10	0.90	0.92	2.0pp	0.200	25	25
claude-sonnet	0.20	0.80	0.92	12.0pp	0.200	25	25
claude-sonnet	0.30	0.70	0.84	14.0pp	0.100	25	25
gpt-4o	0.05	0.95	1.00	5.0pp	0.500	24	19
gpt-4o	0.10	0.90	1.00	10.0pp	0.500	24	19
gpt-4o	0.20	0.80	0.84	4.2pp	0.000	24	19
gpt-4o	0.30	0.70	0.84	14.2pp	0.000	24	19

coverage–width Pareto (inter-judge stand-in)

Sweeping α

A second view: instead of fixing α at 0.10, what happens as we sweep α from 0.5 down to 0.01? The dashed gray line is the nominal coverage target (1 − α); the green line is the empirical coverage on the same data. These curves come from the inter-judge spread inside one production run, not from the held-out calibration probe. The story is the same though. Empirical tracks or exceeds nominal across the range.

from run panoptes-e86ef9e3 · strategy all

candidate generation

We make the judges judge real noisy code.

model

gpt-4o-mini

temperature

0.7

pass rate

94%

passes / fails

47 / 3

Mid-temperature so candidates aren't all correct. We need a meaningful mix of pass/fail to actually measure calibration. (94% pass rate is on the high side. With a weaker generator the high-α coverage rows would land closer to nominal instead of over-covering.)

judges

claude-sonnet

50/50 valid scores

gpt-4o

43/50 valid scores

7 calls failed with OpenAI 429 token-per-minute limit during concurrent runs; framework's error tolerance dropped them and continued.

methodology fine print

Candidates generated by gpt-4o-mini at temperature 0.7 to elicit a mix of correct and incorrect solutions.
Ground-truth labels assigned by sandboxed execution of each candidate against the obfuscated HumanEval test block.
Items split 50/50 into calibration and held-out test sets (deterministic seed=0).
Conformal residual = |judge_score - is_correct|. Split-conformal quantile q computed on the cal split; empirical coverage on the test split.
Conformal theorem guarantees coverage >= 1 - alpha, not = 1 - alpha. Over-covering rows are the safe-side direction.
Standard error on n_test ~ 25 is roughly +/- 6pp, so the 2pp gap is consistent with valid coverage rather than 'perfect calibration proven'.