Does the 90% interval actually contain the truth 90% of the time?
Conformal prediction guarantees, on paper, that the prediction interval contains the true value at least 1 − α of the time. This page measures whether that guarantee actually holds on a real held-out test set, with real ground-truth labels.
Split conformal achieved 92% empirical coverage at the nominal 90% target on the held-out test split of the obfuscated-HumanEval calibration probe. The 2pp gap meets the v1.0 spec target.
- 1Take a held-out calibration set with known labels. For each (item, judge), compute a conformity score. Here it's the absolute difference between the judge's score and the ground-truth label.
- 2Sort those conformity scores. Take the
⌈(n + 1)(1 − α)⌉ / n-th empirical quantile. Call itq. - 3On a new (test) item, the prediction interval is
[ŷ − q, ŷ + q]. That's it.
The guarantee: under exchangeability of calibration and test data, the true label falls inside that interval with probability ≥ 1 − α. No Gaussian assumption, no parametric model. Finite-sample valid.
- •Obfuscated HumanEval. Every problem's entry-point function is renamed to an opaque hash, so judges can't pattern-match memorized solutions. (Memorized solutions would inflate scores artificially.)
- •Real ground truth. Each candidate solution is executed in a sandboxed Python subprocess against the rewritten test block. Pass / fail is mechanical, not another LLM's opinion.
- •50 / 50 split. Half the items fit the conformal quantile; the other half measures whether the quantile actually holds. The deterministic seed makes the split reproducible.
- •Honest noise. With n_test ≈ 25, the standard error on an empirical coverage estimate is ≈ ±6pp. The 2pp gap is consistent with valid coverage, not a claim of perfect calibration.
Empirical vs nominal, every α
Each dot is one (judge, α) measurement. The dashed diagonal is "perfect" calibration, where empirical equals nominal. The shaded green region above the diagonal is the safe-side direction (over-covers, conservative). Conformal's theorem says points should fall in the green region or on the line. The failure mode is points falling below.
Read this as: at the bottom-left (α = 0.4), the target is 60% coverage. At the top-right (α = 0.05), the target is 95% coverage. The dots cluster on or above the line, exactly where the theorem says they should be.
Per-judge × per-α
Green = empirical is within 5pp of nominal. Amber = within 10pp. Red = more than 10pp off. Over-covering counts as fine. The theorem is a lower bound, not equality. The one row that matters most for the v1.0 spec target is Claude at α = 0.10.
| judge | α | nominal (1−α) | empirical | |emp − nom| | q | n_cal | n_test |
|---|---|---|---|---|---|---|---|
| claude-sonnet | 0.05 | 0.95 | 1.00 | 5.0pp | 0.500 | 25 | 25 |
| claude-sonnet | 0.10 | 0.90 | 0.92 | 2.0pp | 0.200 | 25 | 25 |
| claude-sonnet | 0.20 | 0.80 | 0.92 | 12.0pp | 0.200 | 25 | 25 |
| claude-sonnet | 0.30 | 0.70 | 0.84 | 14.0pp | 0.100 | 25 | 25 |
| gpt-4o | 0.05 | 0.95 | 1.00 | 5.0pp | 0.500 | 24 | 19 |
| gpt-4o | 0.10 | 0.90 | 1.00 | 10.0pp | 0.500 | 24 | 19 |
| gpt-4o | 0.20 | 0.80 | 0.84 | 4.2pp | 0.000 | 24 | 19 |
| gpt-4o | 0.30 | 0.70 | 0.84 | 14.2pp | 0.000 | 24 | 19 |
Sweeping α
A second view: instead of fixing α at 0.10, what happens as we sweep α from 0.5 down to 0.01? The dashed gray line is the nominal coverage target (1 − α); the green line is the empirical coverage on the same data. These curves come from the inter-judge spread inside one production run, not from the held-out calibration probe. The story is the same though. Empirical tracks or exceeds nominal across the range.
We make the judges judge real noisy code.
Mid-temperature so candidates aren't all correct. We need a meaningful mix of pass/fail to actually measure calibration. (94% pass rate is on the high side. With a weaker generator the high-α coverage rows would land closer to nominal instead of over-covering.)
- Candidates generated by gpt-4o-mini at temperature 0.7 to elicit a mix of correct and incorrect solutions.
- Ground-truth labels assigned by sandboxed execution of each candidate against the obfuscated HumanEval test block.
- Items split 50/50 into calibration and held-out test sets (deterministic seed=0).
- Conformal residual = |judge_score - is_correct|. Split-conformal quantile q computed on the cal split; empirical coverage on the test split.
- Conformal theorem guarantees coverage >= 1 - alpha, not = 1 - alpha. Over-covering rows are the safe-side direction.
- Standard error on n_test ~ 25 is roughly +/- 6pp, so the 2pp gap is consistent with valid coverage rather than 'perfect calibration proven'.