HE/24

humaneval · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.800

inter-judge spread

0.000

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge.



def largest_divisor(n: int) -> int:
    """ For a given number n, find the largest number that divides n evenly, smaller than n
    >>> largest_divisor(15)
    5
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def largest_divisor(n: int) -> int:
    """ For a given number n, find the largest number that divides n evenly, smaller than n
    >>> largest_divisor(15)
    5
    """
    for i in reversed(range(n)):
        if n % i == 0:
            return i

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

gpt-4o-mini

0.800

$0.0002

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

4. uncertainty quantification

Three statistical methods computed on top of the raw scores above. Each one tells you something different about how much to trust the headline number.

self-consistency

gpt-4o-mini

0.740[0.652, 0.796]

posterior mean · 90% Bayesian-bootstrap CI

variance: 1.80e-2

n samples: 5

semantic-entropy

gpt-4o-mini

H = 0.673/ log 5 ≈ 1.609

2 semantic clusters via bidirectional NLI

entropy as fraction of max

cluster sizes

c1=2 · c2=3

How to read these: self-consistency = does the judge agree with itself when resampled? Narrower CI means more consistent. Semantic entropy = do the judge's rationales cluster into one meaning, or several? Higher entropy means the judge is internally conflicted about why. Decomposition = how much of the total variance comes from within-judge sampling noise (aleatoric, hard to fix) vs between-judge disagreement (epistemic, fixable by calling more judges).

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

gpt-4o-mini

score 0.800 · 854 tokens · 2848ms

The solution correctly identifies the largest divisor for the example provided (15), returning 5. However, it does not handle the edge case where n is 1, which would lead to a division by zero error. The logic is sound, but the implementation could be improved for edge cases and efficiency by starting the loop from n//2 instead of n-1.

missing_tests