calib·HumanEval/15

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

1.000

inter-judge spread

0.000

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from string_sequence to fn_9768079406 so the judges can't pattern-match a memorized HumanEval solution.



def fn_9768079406(n: int) -> str:
    """ Return a string containing space-delimited numbers starting from 0 upto n inclusive.
    >>> fn_9768079406(0)
    '0'
    >>> fn_9768079406(5)
    '0 1 2 3 4 5'
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_9768079406(n: int) -> str:
    """ Return a string containing space-delimited numbers starting from 0 upto n inclusive.
    >>> fn_9768079406(0)
    '0'
    >>> fn_9768079406(5)
    '0 1 2 3 4 5'
    """
    return ' '.join([str(x) for x in range(n + 1)])

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

1.000

$0.0071

gpt-4o

1.000

$0.0029

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

4. uncertainty quantification

Three statistical methods computed on top of the raw scores above. Each one tells you something different about how much to trust the headline number.

semantic-entropy

claude-sonnet

H = 0.000/ log 10 ≈ 2.303

1 semantic cluster via bidirectional NLI

entropy as fraction of max

cluster sizes

c1=10

semantic-entropy

gpt-4o

H = 0.000/ log 10 ≈ 2.303

1 semantic cluster via bidirectional NLI

entropy as fraction of max

cluster sizes

c1=10

How to read these: self-consistency = does the judge agree with itself when resampled? Narrower CI means more consistent. Semantic entropy = do the judge's rationales cluster into one meaning, or several? Higher entropy means the judge is internally conflicted about why. Decomposition = how much of the total variance comes from within-judge sampling noise (aleatoric, hard to fix) vs between-judge disagreement (epistemic, fixable by calling more judges).

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 1.000 · 1751 tokens · 2941ms

The solution correctly joins numbers from 0 to n inclusive as space-delimited strings. It handles the base case (n=0 → '0'), the example case (n=5 → '0 1 2 3 4 5'), and edge cases like n=1. The implementation is idiomatic Python using a list comprehension with str() and ' '.join().

gpt-4o

score 1.000 · 903 tokens · 1191ms

The candidate solution correctly implements the function to return a space-delimited string of numbers from 0 to n inclusive. It uses a list comprehension to convert each number in the range to a string and then joins them with spaces. The solution is idiomatic, clear, and handles edge cases such as n=0 correctly, as demonstrated by the provided examples.