calib·HumanEval/18

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score

0.975

inter-judge spread

0.050

judges polled

sampling draws

1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from how_many_times to fn_f1820e35c0 so the judges can't pattern-match a memorized HumanEval solution.



def fn_f1820e35c0(string: str, substring: str) -> int:
    """ Find how many times a given substring can be found in the original string. Count overlaping cases.
    >>> fn_f1820e35c0('', 'a')
    0
    >>> fn_f1820e35c0('aaa', 'a')
    3
    >>> fn_f1820e35c0('aaaa', 'aa')
    3
    """

2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.



def fn_f1820e35c0(string: str, substring: str) -> int:
    """ Find how many times a given substring can be found in the original string. Count overlaping cases.
    >>> fn_f1820e35c0('', 'a')
    0
    >>> fn_f1820e35c0('aaa', 'a')
    3
    >>> fn_f1820e35c0('aaaa', 'aa')
    3
    """
    times = 0

    for i in range(len(string) - len(substring) + 1):
        if string[i:i+len(substring)] == substring:
            times += 1

    return times

3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet

0.950

$0.0087

gpt-4o

1.000

$0.0031

3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

4. uncertainty quantification

Three statistical methods computed on top of the raw scores above. Each one tells you something different about how much to trust the headline number.

semantic-entropy

claude-sonnet

H = 0.325/ log 10 ≈ 2.303

2 semantic clusters via bidirectional NLI

entropy as fraction of max

cluster sizes

c1=9 · c2=1

semantic-entropy

gpt-4o

H = 0.325/ log 10 ≈ 2.303

2 semantic clusters via bidirectional NLI

entropy as fraction of max

cluster sizes

c1=9 · c2=1

How to read these: self-consistency = does the judge agree with itself when resampled? Narrower CI means more consistent. Semantic entropy = do the judge's rationales cluster into one meaning, or several? Higher entropy means the judge is internally conflicted about why. Decomposition = how much of the total variance comes from within-judge sampling noise (aleatoric, hard to fix) vs between-judge disagreement (epistemic, fixable by calling more judges).

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet

score 0.950 · 1938 tokens · 5614ms

The solution correctly implements overlapping substring counting using a sliding window approach. It passes all three provided doctests: empty string returns 0, 'aaa'/'a' returns 3, 'aaaa'/'aa' returns 3. Edge cases like empty substring could be problematic (infinite loop or ZeroDivisionError? No — `range(len(string) - 0 + 1)` = `range(len(string)+1)` which is valid, and `string[i:i+0] == ''` is always True, so it returns `len(string)+1` which is a reasonable/debatable behavior). The logic is sound for all realistic inputs. Minor style note: no handling of empty substring edge case, but the task doesn't specify it. Overall correct and clean.

gpt-4o

score 1.000 · 987 tokens · 1580ms

The candidate solution correctly implements the function to count overlapping occurrences of a substring within a string. It iterates over the string, checking each possible starting position for the substring and increments the count when a match is found. The solution passes all provided test cases, including edge cases such as an empty string and overlapping substrings. The implementation is clear and idiomatic.