calib·HumanEval/44

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.900

inter-judge spread

0.200

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from change_base to fn_5e2a4f7a73 so the judges can't pattern-match a memorized HumanEval solution.



def fn_5e2a4f7a73(x: int, base: int):
    """Change numerical base of input number x to base.
    return string representation after the conversion.
    base numbers are less than 10.
    >>> fn_5e2a4f7a73(8, 3)
    '22'
    >>> fn_5e2a4f7a73(8, 2)
    '1000'
    >>> fn_5e2a4f7a73(7, 2)
    '111'
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_5e2a4f7a73(x: int, base: int):
    """Change numerical base of input number x to base.
    return string representation after the conversion.
    base numbers are less than 10.
    >>> fn_5e2a4f7a73(8, 3)
    '22'
    >>> fn_5e2a4f7a73(8, 2)
    '1000'
    >>> fn_5e2a4f7a73(7, 2)
    '111'
    """
    ret = ""
    while x > 0:
        ret = str(x % base) + ret
        x //= base
    return ret

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.800

$0.0078

gpt-4o

1.000

$0.0034

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.800 · 1884 tokens · 4536ms

The solution correctly handles all three provided examples and works for typical positive integers. The algorithm is standard and correct for positive x. However, it fails for x=0 (returns empty string "" instead of "0"), which is a plausible edge case. The task says "base numbers are less than 10" but doesn't restrict x to positive values, so x=0 is a realistic edge case. No handling of negative numbers either, but that's less clearly required. Minor issue only with x=0.

gpt-4o

score 1.000 · 1034 tokens · 1946ms

The candidate solution correctly implements the conversion of a number from base 10 to another base less than 10. It uses a loop to repeatedly divide the number by the base and prepend the remainder to the result string, which is the correct algorithm for base conversion. The solution passes all provided examples and handles edge cases such as converting 0, which would return an empty string, but this is consistent with the task's description and examples. The code is clear and idiomatic.