calib·HumanEval/46

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.975

inter-judge spread

0.050

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from fib4 to fn_14ccd9bcc3 so the judges can't pattern-match a memorized HumanEval solution.



def fn_14ccd9bcc3(n: int):
    """The Fib4 number sequence is a sequence similar to the Fibbonacci sequnece that's defined as follows:
    fn_14ccd9bcc3(0) -> 0
    fn_14ccd9bcc3(1) -> 0
    fn_14ccd9bcc3(2) -> 2
    fn_14ccd9bcc3(3) -> 0
    fn_14ccd9bcc3(n) -> fn_14ccd9bcc3(n-1) + fn_14ccd9bcc3(n-2) + fn_14ccd9bcc3(n-3) + fn_14ccd9bcc3(n-4).
    Please write a function to efficiently compute the n-th element of the fn_14ccd9bcc3 number sequence.  Do not use recursion.
    >>> fn_14ccd9bcc3(5)
    4
    >>> fn_14ccd9bcc3(6)
    8
    >>> fn_14ccd9bcc3(7)
    14
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_14ccd9bcc3(n: int):
    """The Fib4 number sequence is a sequence similar to the Fibbonacci sequnece that's defined as follows:
    fn_14ccd9bcc3(0) -> 0
    fn_14ccd9bcc3(1) -> 0
    fn_14ccd9bcc3(2) -> 2
    fn_14ccd9bcc3(3) -> 0
    fn_14ccd9bcc3(n) -> fn_14ccd9bcc3(n-1) + fn_14ccd9bcc3(n-2) + fn_14ccd9bcc3(n-3) + fn_14ccd9bcc3(n-4).
    Please write a function to efficiently compute the n-th element of the fn_14ccd9bcc3 number sequence.  Do not use recursion.
    >>> fn_14ccd9bcc3(5)
    4
    >>> fn_14ccd9bcc3(6)
    8
    >>> fn_14ccd9bcc3(7)
    14
    """
    results = [0, 0, 2, 0]
    if n < 4:
        return results[n]

    for _ in range(4, n + 1):
        results.append(results[-1] + results[-2] + results[-3] + results[-4])
        results.pop(0)

    return results[-1]

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.950

$0.010

gpt-4o

1.000

$0.0040

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.950 · 2322 tokens · 5145ms

The solution correctly implements the Fib4 sequence iteratively (no recursion). Base cases [0,0,2,0] are correct. The sliding window approach (append + pop(0)) correctly computes each next value. Verified: fib4(4)=0+2+0+0=2, fib4(5)=2+0+2+0=4✓, fib4(6)=4+2+0+2=8✓, fib4(7)=8+4+2+0=14✓. Edge cases n=0,1,2,3 handled correctly. Minor inefficiency: pop(0) on a list is O(n), but functionally correct. Could use a deque or index-based approach for O(1) pops, but this is a minor style/efficiency concern, not a correctness issue.

gpt-4o

score 1.000 · 1295 tokens · 1554ms

The candidate solution correctly implements the Fib4 sequence as described in the task. It initializes the sequence with the correct base cases and uses an iterative approach to compute the n-th element without recursion, as required. The solution efficiently maintains only the last four elements of the sequence, ensuring optimal space usage. It correctly handles edge cases for n < 4 by returning precomputed values. The solution passes all provided examples and is idiomatic and clear.