run panoptes-44c4e9b3
2026-06-04 18:45:08·strategy: all·demo_calibration.duckdb
items
16
judge calls
32
UQ results
0
cost
$0.196
46.8k tokens
cost by judge
total
$0.196
claude-sonnet$0.141gpt-4o$0.055
score distribution (point pass, by judge)
items
| item | family | scores | UQ | |
|---|---|---|---|---|
| calib·HumanEval/34 | code | claude-sonnet 1.000 gpt-4o 1.000 | — | drill |
| calib·HumanEval/35 | code | claude-sonnet 0.800 gpt-4o 0.500 | — | drill |
| calib·HumanEval/36 | code | claude-sonnet 0.950 gpt-4o 0.800 | — | drill |
| calib·HumanEval/37 | code | claude-sonnet 1.000 gpt-4o 0.800 | — | drill |
| calib·HumanEval/38 | code | claude-sonnet 1.000 gpt-4o 0.500 | — | drill |
| calib·HumanEval/39 | code | claude-sonnet 0.850 gpt-4o 0.800 | — | drill |
| calib·HumanEval/40 | code | claude-sonnet 0.800 gpt-4o 1.000 | — | drill |
| calib·HumanEval/41 | code | claude-sonnet 1.000 gpt-4o 1.000 | — | drill |
| calib·HumanEval/42 | code | claude-sonnet 1.000 gpt-4o 1.000 | — | drill |
| calib·HumanEval/43 | code | claude-sonnet 0.950 gpt-4o 1.000 | — | drill |
| calib·HumanEval/44 | code | claude-sonnet 0.800 gpt-4o 1.000 | — | drill |
| calib·HumanEval/45 | code | claude-sonnet 1.000 gpt-4o 1.000 | — | drill |
| calib·HumanEval/46 | code | claude-sonnet 0.950 gpt-4o 1.000 | — | drill |
| calib·HumanEval/47 | code | claude-sonnet 1.000 gpt-4o 1.000 | — | drill |
| calib·HumanEval/48 | code | claude-sonnet 1.000 gpt-4o 0.800 | — | drill |
| calib·HumanEval/49 | code | claude-sonnet 0.800 gpt-4o 0.800 | — | drill |