calib·HumanEval/35

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.650

inter-judge spread

0.300

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from max_element to fn_41f4405780 so the judges can't pattern-match a memorized HumanEval solution.



def fn_41f4405780(l: list):
    """Return maximum element in the list.
    >>> fn_41f4405780([1, 2, 3])
    3
    >>> fn_41f4405780([5, 3, -5, 2, -3, 3, 9, 0, 123, 1, -10])
    123
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_41f4405780(l: list):
    """Return maximum element in the list.
    >>> fn_41f4405780([1, 2, 3])
    3
    >>> fn_41f4405780([5, 3, -5, 2, -3, 3, 9, 0, 123, 1, -10])
    123
    """
    m = l[0]
    for e in l:
        if e > m:
            m = e
    return m

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.800

$0.0075

gpt-4o

0.500

$0.0032

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.800 · 1801 tokens · 3964ms

The solution correctly implements a linear scan to find the maximum element, passing both provided examples. The logic is sound: it initializes `m` to the first element and updates it whenever a larger element is found. However, it fails on an edge case: an empty list will raise an `IndexError` at `l[0]` rather than handling it gracefully (e.g., raising a `ValueError` with a meaningful message or returning `None`). This is a plausible edge case for a "find maximum" function. Otherwise the solution is clean and idiomatic.

gpt-4o

score 0.500 · 966 tokens · 1706ms

The candidate solution correctly implements a function to find the maximum element in a list by iterating through the list and updating the maximum value found. It passes the provided examples. However, it fails to handle edge cases such as an empty list, which would raise an IndexError when trying to access l[0]. This is a significant oversight, as the function should handle such cases gracefully, either by returning None or raising a more informative exception.

missing_tests