calib·HumanEval/38

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.750

inter-judge spread

0.500

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from decode_cyclic to fn_ac049a10d4 so the judges can't pattern-match a memorized HumanEval solution.



def encode_cyclic(s: str):
    """
    returns encoded string by cycling groups of three characters.
    """
    # split string to groups. Each of length 3.
    groups = [s[(3 * i):min((3 * i + 3), len(s))] for i in range((len(s) + 2) // 3)]
    # cycle elements in each group. Unless group has fewer elements than 3.
    groups = [(group[1:] + group[0]) if len(group) == 3 else group for group in groups]
    return "".join(groups)


def fn_ac049a10d4(s: str):
    """
    takes as input string encoded with encode_cyclic function. Returns decoded string.
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def encode_cyclic(s: str):
    """
    returns encoded string by cycling groups of three characters.
    """
    # split string to groups. Each of length 3.
    groups = [s[(3 * i):min((3 * i + 3), len(s))] for i in range((len(s) + 2) // 3)]
    # cycle elements in each group. Unless group has fewer elements than 3.
    groups = [(group[1:] + group[0]) if len(group) == 3 else group for group in groups]
    return "".join(groups)


def fn_ac049a10d4(s: str):
    """
    takes as input string encoded with encode_cyclic function. Returns decoded string.
    """
    return encode_cyclic(encode_cyclic(s))

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

1.000

$0.0083

gpt-4o

0.500

$0.0037

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 1.000 · 1988 tokens · 4933ms

The encode_cyclic function performs a left rotation by 1 on each group of 3 characters (ABC → BCA). To decode, you need to reverse this: apply the rotation twice more (since 3 rotations = identity, so 2 more rotations = inverse). Calling encode_cyclic twice on the encoded string achieves exactly this: BCA → CAB → ABC. This is correct for all cases: full groups of 3 (rotated twice more = original), partial groups of 1 or 2 (left unchanged by encode_cyclic, so applying it twice still leaves them unchanged). The solution handles all edge cases correctly including empty strings.

gpt-4o

score 0.500 · 1103 tokens · 2266ms

The candidate solution attempts to decode a string encoded with the `encode_cyclic` function by applying the `encode_cyclic` function twice. This approach works because applying the cyclic shift twice on a group of three characters returns the group to its original order. However, the solution lacks clarity and explicitness in explaining why this works. Additionally, it does not handle edge cases explicitly, such as when the input string length is not a multiple of three, although the logic inherently handles it correctly. The solution is partially correct but lacks clarity and explicit handling of edge cases.