calib·HumanEval/48

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.900

inter-judge spread

0.200

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from is_palindrome to fn_316279a43c so the judges can't pattern-match a memorized HumanEval solution.



def fn_316279a43c(text: str):
    """
    Checks if given string is a palindrome
    >>> fn_316279a43c('')
    True
    >>> fn_316279a43c('aba')
    True
    >>> fn_316279a43c('aaaaa')
    True
    >>> fn_316279a43c('zbcd')
    False
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_316279a43c(text: str):
    """
    Checks if given string is a palindrome
    >>> fn_316279a43c('')
    True
    >>> fn_316279a43c('aba')
    True
    >>> fn_316279a43c('aaaaa')
    True
    >>> fn_316279a43c('zbcd')
    False
    """
    for i in range(len(text)):
        if text[i] != text[len(text) - 1 - i]:
            return False
    return True

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

1.000

$0.0073

gpt-4o

0.800

$0.0032

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 1.000 · 1797 tokens · 2930ms

The solution correctly checks if a string is a palindrome by comparing characters from both ends. It handles all provided test cases: empty string returns True, 'aba' returns True, 'aaaaa' returns True, 'zbcd' returns False. The logic is sound for all edge cases including single characters, even-length strings, and odd-length strings. The implementation is clear and idiomatic.

gpt-4o

score 0.800 · 960 tokens · 1906ms

The candidate solution correctly checks if a given string is a palindrome by comparing characters from the start and end of the string moving towards the center. It passes the provided examples, including the edge case of an empty string. However, the loop iterates over the entire length of the string, which is unnecessary. It should only iterate up to the midpoint, as checking beyond that is redundant. This is a minor efficiency issue, hence the score of 0.80.