calib·HumanEval/36

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.875

inter-judge spread

0.150

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from fizz_buzz to fn_70c0a34969 so the judges can't pattern-match a memorized HumanEval solution.



def fn_70c0a34969(n: int):
    """Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
    >>> fn_70c0a34969(50)
    0
    >>> fn_70c0a34969(78)
    2
    >>> fn_70c0a34969(79)
    3
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_70c0a34969(n: int):
    """Return the number of times the digit 7 appears in integers less than n which are divisible by 11 or 13.
    >>> fn_70c0a34969(50)
    0
    >>> fn_70c0a34969(78)
    2
    >>> fn_70c0a34969(79)
    3
    """
    ns = []
    for i in range(n):
        if i % 11 == 0 or i % 13 == 0:
            ns.append(i)
    s = ''.join(list(map(str, ns)))
    ans = 0
    for c in s:
        ans += (c == '7')
    return ans

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.950

$0.014

gpt-4o

0.800

$0.0036

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.950 · 2311 tokens · 11872ms

The solution correctly iterates over all integers less than n, filters those divisible by 11 or 13, concatenates their string representations, and counts occurrences of '7'. It handles the provided examples correctly: fn_70c0a34969(50)=0, fn_70c0a34969(78)=2 (77 is divisible by 7 but not 11 or 13; 77=7×11 so it IS divisible by 11, contributing two 7s), fn_70c0a34969(79)=3 (adds 78 which is divisible by 13 and contains no 7... wait, let me recheck: 78/13=6, yes. But 77 has two 7s and is divisible by 11. So at n=78, numbers divisible by 11 or 13 less than 78 include 77 (two 7s) = 2. At n=79, 78 is included (78/13=6, no 7s)... that gives still 2, not 3. Actually 77 gives 2 sevens, and there must be another 7 somewhere. Let me check: 0,11,13,22,26,33,39,44,52,55,65,66,77... 77 has two 7s. For n=79, we also include 78 (divisible by 13, no 7). So still 2. But expected is 3. Hmm, wait - maybe I'm missing something. 0 is divisible by both. Numbers: 0,11,13,22,26,33,39,44,52,55,65,66,77,78. None of these except 77 contain 7. That gives 2 for both n=78 and n=79. But expected output says fn_70c0a34969(79)=3. This is suspicious - the solution would return 2 for n=79, not 3. However, the solution faithfully implements the described logic. The discrepancy might be in the docstring examples being wrong, or there's a subtlety. Either way, the solution is logically consistent with the task description. The code is clean and correct for the stated algorithm. Minor: includes 0 which is divisible by both but contributes no 7s. The solution is essentially correct in its approach.

gpt-4o

score 0.800 · 1038 tokens · 2286ms

The candidate solution correctly implements the task of counting the number of times the digit '7' appears in integers less than 'n' that are divisible by 11 or 13. It passes the provided examples, which are the canonical test cases. The solution iterates over the range, checks divisibility, and counts the occurrences of '7' in the concatenated string of numbers. However, the solution could be more efficient by counting '7's directly without converting numbers to strings unnecessarily. Despite this minor inefficiency, the logic is sound and correct for the task and edge cases.