calib·HumanEval/43

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.975

inter-judge spread

0.050

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from pairs_sum_to_zero to fn_c751c38b6a so the judges can't pattern-match a memorized HumanEval solution.



def fn_c751c38b6a(l):
    """
    fn_c751c38b6a takes a list of integers as an input.
    it returns True if there are two distinct elements in the list that
    sum to zero, and False otherwise.
    >>> fn_c751c38b6a([1, 3, 5, 0])
    False
    >>> fn_c751c38b6a([1, 3, -2, 1])
    False
    >>> fn_c751c38b6a([1, 2, 3, 7])
    False
    >>> fn_c751c38b6a([2, 4, -5, 3, 5, 7])
    True
    >>> fn_c751c38b6a([1])
    False
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_c751c38b6a(l):
    """
    fn_c751c38b6a takes a list of integers as an input.
    it returns True if there are two distinct elements in the list that
    sum to zero, and False otherwise.
    >>> fn_c751c38b6a([1, 3, 5, 0])
    False
    >>> fn_c751c38b6a([1, 3, -2, 1])
    False
    >>> fn_c751c38b6a([1, 2, 3, 7])
    False
    >>> fn_c751c38b6a([2, 4, -5, 3, 5, 7])
    True
    >>> fn_c751c38b6a([1])
    False
    """
    for i, l1 in enumerate(l):
        for j in range(i + 1, len(l)):
            if l1 + l[j] == 0:
                return True
    return False

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.950

$0.0082

gpt-4o

1.000

$0.0037

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.950 · 2028 tokens · 3904ms

The solution correctly implements the task: it checks all pairs of distinct elements (by index) and returns True if any pair sums to zero. It passes all provided doctests. Edge cases like [0, 0] would correctly return True (two distinct positions with value 0 summing to 0), and [0] returns False. The only minor issue is O(n²) time complexity when a set-based O(n) solution would be more idiomatic, but correctness is solid.

gpt-4o

score 1.000 · 1164 tokens · 3505ms

The candidate solution correctly implements the function to check if there are two distinct elements in the list that sum to zero. It uses a nested loop to iterate over all pairs of elements, ensuring they are distinct by starting the inner loop from the next element. The solution correctly returns True if such a pair is found and False otherwise. The function passes all provided test cases, including edge cases like a single-element list and lists without any zero-sum pairs. The implementation is clear and idiomatic.