run panoptes-e86ef9e3
2026-06-04 18:12:07·strategy: all·demo_calibration.duckdb
items
34
judge calls
764
UQ results
69
cost
$4.74
497.5k tokens
cost by judge
total
$4.74
claude-sonnet$3.49gpt-4o$1.26
score distribution (point pass, by judge)
items
| item | family | scores | UQ | |
|---|---|---|---|---|
| calib·HumanEval/0 | code | claude-sonnet 0.950 gpt-4o 0.800 | 2 blob(s) | drill |
| calib·HumanEval/1 | code | claude-sonnet 1.000 gpt-4o 0.800 | 2 blob(s) | drill |
| calib·HumanEval/10 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/11 | code | claude-sonnet 0.800 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/12 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/13 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/14 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/15 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/16 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/17 | code | claude-sonnet 0.950 gpt-4o 0.800 | 2 blob(s) | drill |
| calib·HumanEval/18 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/19 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/2 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/20 | code | claude-sonnet 0.950 gpt-4o 0.500 | 2 blob(s) | drill |
| calib·HumanEval/21 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/22 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/23 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/24 | code | claude-sonnet 0.800 gpt-4o 0.800 | 2 blob(s) | drill |
| calib·HumanEval/25 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/26 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/27 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/28 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/29 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/3 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/30 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/31 | code | claude-sonnet 0.800 gpt-4o 0.500 | 2 blob(s) | drill |
| calib·HumanEval/32 | code | claude-sonnet 0.850 gpt-4o 0.800 | 2 blob(s) | drill |
| calib·HumanEval/33 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/4 | code | claude-sonnet 0.950 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/5 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/6 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/7 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/8 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |
| calib·HumanEval/9 | code | claude-sonnet 1.000 gpt-4o 1.000 | 2 blob(s) | drill |