PANOPTES
judge agreement

Do the judges agree with each other?

PANOPTES routinely runs three different LLM judges (Claude, GPT-4o, Gemini) on the same task. If they're all rating the same latent quality on the same scale, their scores should be tightly correlated. If they're not, the framework's job is to give us a principled way to combine their disagreement into a posterior. But it's worth knowing how much disagreement there is to begin with.

Spearman ρ

Rank-correlation: do the judges agree on which items are better than which? Insensitive to systematic bias (one judge rating 0.7 where another rates 0.5 is fine, as long as they preserve the ordering).

Kendall τ

Same idea as ρ but counts concordant vs. discordant pairs of items. More conservative; less sensitive to outliers. A different lens on the same "do they rank things the same way" question.

Permutation p

Probability the observed mean disagreement |a − b| would arise if we randomly shuffled which score belongs to which judge. Small p → disagreement is structural, not noise.

at a glance

Pairwise Spearman ρ heatmap

Each cell is the rank-correlation between two judges. Greener = more agreement on ordering. Diagonal is by definition 1.0 (judge vs itself). Empty cells = the two judges didn't both score any common items in this run.

from run panoptes-696da4d5 · strategy all
claude-haiku
claude-sonnet
gpt-4o-mini
claude-haiku1.000.49-0.13
claude-sonnet0.491.000.11
gpt-4o-mini-0.130.111.00
high agreementpartiallow / no signaldisagreement
per pair

Scatter + bootstrap CIs

One scatter per judge pair. Each dot is one item; the dashed line is "perfect agreement." The Spearman ρ and Kendall τ next to it come with 90% paired-bootstrap CIs. The framework never reports rank correlation as a point estimate.

claude-haikuvsclaude-sonnetn=22
Spearman ρ
0.485
[n/a, n/a]
Kendall τ
0.458
[n/a, n/a]
permutation p
1.000
obs 0.089
claude-haikuvsgpt-4o-minin=22
Spearman ρ
-0.131
[n/a, n/a]
Kendall τ
-0.126
[n/a, n/a]
permutation p
1.000
obs 0.130
claude-sonnetvsgpt-4o-minin=22
Spearman ρ
0.115
[n/a, n/a]
Kendall τ
0.111
[n/a, n/a]
permutation p
1.000
obs 0.055
by run

How agreement changes by strategy

The same judges may agree more or less depending on which items they were asked to rate. Runs using bandit routing tend to select harder items more often, which can suppress correlation; the all-judges runs see every item.

panoptes-696da4d5
all
claude-haikuclaude-sonnetρ = 0.485
claude-haikugpt-4o-miniρ = -0.131
claude-sonnetgpt-4o-miniρ = 0.115
panoptes-d636a93f
bandit
claude-haikugpt-4o-miniρ = 0.124
claude-sonnetgpt-4o-miniρ = -0.188
panoptes-e86ef9e3
all
claude-sonnetgpt-4oρ = 0.503
panoptes-44c4e9b3
all
claude-sonnetgpt-4oρ = 0.150