PANOPTES
panoptes-e86ef9e3

calib·HumanEval/22

humaneval-calibprobe · code

This is one (task, candidate response) pair flowing through the full PANOPTES pipeline. Each section below is a stage of the analysis: the task itself, the candidate solution being evaluated, every judge's score + rationale, the sampling-pass dispersion that captures within-judge noise, and the uncertainty-quantification metrics computed on top.

mean score
0.975
inter-judge spread
0.050
judges polled
2
sampling draws
20
1. the task

The function signature + docstring presented to both the model under test and to every judge. The entry-point name was obfuscated from filter_integers to fn_cc301d73c6 so the judges can't pattern-match a memorized HumanEval solution.

from typing import List, Any


def fn_cc301d73c6(values: List[Any]) -> List[int]:
    """ Filter given list of any python values only for integers
    >>> fn_cc301d73c6(['a', 3.14, 5])
    [5]
    >>> fn_cc301d73c6([1, 2, 3, 'abc', {}, []])
    [1, 2, 3]
    """
2. the candidate response judged

The full solution every judge is grading. For these runs the candidate is the reference solution prepended with the task signature so it parses as a complete program.

from typing import List, Any


def fn_cc301d73c6(values: List[Any]) -> List[int]:
    """ Filter given list of any python values only for integers
    >>> fn_cc301d73c6(['a', 3.14, 5])
    [5]
    >>> fn_cc301d73c6([1, 2, 3, 'abc', {}, []])
    [1, 2, 3]
    """
    return [x for x in values if isinstance(x, int)]
3a. point-pass scores (temperature 0)

One call per judge at temperature=0. This is the "best single guess" each judge has. Disagreement here is structural. The judges literally see this candidate differently.

claude-sonnet
0.950
$0.0083
gpt-4o
1.000
$0.0027
3b. sampling-pass dispersion (temperature 1)

n draws per judge at temperature=1. Dots are individual draws; vertical bar is the mean; shaded band is ±1σ. Wide band = the judge is uncertain even with itself; narrow band = it's consistent.

4. uncertainty quantification

Three statistical methods computed on top of the raw scores above. Each one tells you something different about how much to trust the headline number.

semantic-entropy
claude-sonnet
H = 0.325/ log 102.303
2 semantic clusters via bidirectional NLI
entropy as fraction of max
cluster sizes
c1=9 · c2=1
semantic-entropy
gpt-4o
H = 0.000/ log 102.303
1 semantic cluster via bidirectional NLI
entropy as fraction of max
cluster sizes
c1=10

How to read these: self-consistency = does the judge agree with itself when resampled? Narrower CI means more consistent. Semantic entropy = do the judge's rationales cluster into one meaning, or several? Higher entropy means the judge is internally conflicted about why. Decomposition = how much of the total variance comes from within-judge sampling noise (aleatoric, hard to fix) vs between-judge disagreement (epistemic, fixable by calling more judges).

5. judge rationales

The natural-language explanation each judge gave alongside its score. These are produced via tool-use structured output, so the score and the rationale are guaranteed to come from the same forward pass.

claude-sonnet
score 0.950 · 1868 tokens · 4951ms

The solution correctly filters a list to only include integers using `isinstance(x, int)`. It passes both provided doctests. One subtle edge case: in Python, `bool` is a subclass of `int`, so `True` and `False` would be included in the output (e.g., `fn_cc301d73c6([True, False, 1])` returns `[True, False, 1]`). Whether this is correct depends on interpretation — the task says "filter for integers" and booleans are technically integers in Python. This is a minor ambiguity rather than a clear bug. The solution is idiomatic, concise, and correct for all realistic inputs described in the task.

gpt-4o
score 1.000 · 920 tokens · 1270ms

The candidate solution correctly filters a list to include only integer values using a list comprehension with isinstance. It passes the provided examples and handles edge cases such as empty lists and lists with no integers. The implementation is idiomatic and clear.