calib·HumanEval/39

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.825

inter-judge spread

0.050

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from prime_fib to fn_1b36ebe371 so the judges can't pattern-match a memorized HumanEval solution.



def fn_1b36ebe371(n: int):
    """
    fn_1b36ebe371 returns n-th number that is a Fibonacci number and it's also prime.
    >>> fn_1b36ebe371(1)
    2
    >>> fn_1b36ebe371(2)
    3
    >>> fn_1b36ebe371(3)
    5
    >>> fn_1b36ebe371(4)
    13
    >>> fn_1b36ebe371(5)
    89
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_1b36ebe371(n: int):
    """
    fn_1b36ebe371 returns n-th number that is a Fibonacci number and it's also prime.
    >>> fn_1b36ebe371(1)
    2
    >>> fn_1b36ebe371(2)
    3
    >>> fn_1b36ebe371(3)
    5
    >>> fn_1b36ebe371(4)
    13
    >>> fn_1b36ebe371(5)
    89
    """
    import math

    def is_prime(p):
        if p < 2:
            return False
        for k in range(2, min(int(math.sqrt(p)) + 1, p - 1)):
            if p % k == 0:
                return False
        return True
    f = [0, 1]
    while True:
        f.append(f[-1] + f[-2])
        if is_prime(f[-1]):
            n -= 1
        if n == 0:
            return f[-1]

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.850

$0.0092

gpt-4o

0.800

$0.0039

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

no sampling pass for this run

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.850 · 2063 tokens · 5140ms

The solution correctly generates Fibonacci numbers and checks primality, returning the n-th Fibonacci prime. It passes all five provided examples. The `is_prime` function has a minor issue: `range(2, min(int(math.sqrt(p)) + 1, p - 1))` — for p=2, this becomes `range(2, min(2, 1))` = `range(2, 1)` which is empty, so it correctly returns True. For p=3, `range(2, min(3, 2))` = `range(2, 2)` which is empty, correctly returns True. The logic works for all small primes. The `min(..., p-1)` guard is unnecessary but harmless for primes ≥ 2. The solution is functionally correct for all realistic inputs, with only a minor style concern about the redundant `min` bound.

gpt-4o

score 0.800 · 1140 tokens · 2236ms

The candidate solution correctly implements the task of finding the n-th Fibonacci number that is also prime. It uses a helper function `is_prime` to check for primality and generates Fibonacci numbers in a loop until the n-th prime Fibonacci number is found. The solution passes the provided examples and handles the task correctly. However, there is a minor inefficiency in the `is_prime` function, where the loop could be optimized to iterate only up to the square root of `p`, inclusive, without the `min` function. This is a minor style issue and does not affect correctness, hence the score of 0.80.