calib·HumanEval/17

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.875

inter-judge spread

0.150

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from parse_music to fn_2bcdef9bb0 so the judges can't pattern-match a memorized HumanEval solution.

from typing import List


def fn_2bcdef9bb0(music_string: str) -> List[int]:
    """ Input to this function is a string representing musical notes in a special ASCII format.
    Your task is to parse this string and return list of integers corresponding to how many beats does each
    not last.

    Here is a legend:
    'o' - whole note, lasts four beats
    'o|' - half note, lasts two beats
    '.|' - quater note, lasts one beat

    >>> fn_2bcdef9bb0('o o| .| o| o| .| .| .| .| o o')
    [4, 2, 1, 2, 2, 1, 1, 1, 1, 4, 4]
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.

from typing import List


def fn_2bcdef9bb0(music_string: str) -> List[int]:
    """ Input to this function is a string representing musical notes in a special ASCII format.
    Your task is to parse this string and return list of integers corresponding to how many beats does each
    not last.

    Here is a legend:
    'o' - whole note, lasts four beats
    'o|' - half note, lasts two beats
    '.|' - quater note, lasts one beat

    >>> fn_2bcdef9bb0('o o| .| o| o| .| .| .| .| o o')
    [4, 2, 1, 2, 2, 1, 1, 1, 1, 4, 4]
    """
    note_map = {'o': 4, 'o|': 2, '.|': 1}
    return [note_map[x] for x in music_string.split(' ') if x]

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.950

$0.0087

gpt-4o

0.800

$0.0039

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

4. uncertainty quantification

Three statistical methods computed on top of the raw scores above. Each one tells you something different about how much to trust the headline number.

semantic-entropy

claude-sonnet

H = 0.000/ log 10 ≈ 2.303

1 semantic cluster via bidirectional NLI

entropy as fraction of max

cluster sizes

c1=10

semantic-entropy

gpt-4o

H = 1.168/ log 10 ≈ 2.303

4 semantic clusters via bidirectional NLI

entropy as fraction of max

cluster sizes

c1=3 · c2=1 · c3=5 · c4=1

How to read these: self-consistency = does the judge agree with itself when resampled? Narrower CI means more consistent. Semantic entropy = do the judge's rationales cluster into one meaning, or several? Higher entropy means the judge is internally conflicted about why. Decomposition = how much of the total variance comes from within-judge sampling noise (aleatoric, hard to fix) vs between-judge disagreement (epistemic, fixable by calling more judges).

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.950 · 2051 tokens · 4717ms

The solution correctly implements the note parsing logic. It splits the string by spaces, filters empty strings (handling edge cases like leading/trailing spaces or multiple spaces), and maps each note to its beat count using a dictionary. The canonical example produces the correct output [4, 2, 1, 2, 2, 1, 1, 1, 1, 4, 4]. The `if x` guard handles empty strings from split. The only minor concern is that an unrecognized note token would raise a KeyError rather than being handled gracefully, but the task doesn't specify error handling for invalid input. Overall, this is a clean, idiomatic solution.

gpt-4o

score 0.800 · 1175 tokens · 2090ms

The candidate solution correctly implements the task by mapping musical notes to their respective beats using a dictionary and list comprehension. It handles the example provided in the docstring correctly. However, there is a minor issue: the solution does not handle extra spaces between notes, which could lead to incorrect results if the input string contains multiple spaces. For example, 'o o|' would result in [4] instead of [4, 2]. This is a minor edge case that could be addressed by filtering out empty strings after splitting, but it does not affect the correctness for the given example.