Do the judges agree with each other?
PANOPTES routinely runs three different LLM judges (Claude, GPT-4o, Gemini) on the same task. If they're all rating the same latent quality on the same scale, their scores should be tightly correlated. If they're not, the framework's job is to give us a principled way to combine their disagreement into a posterior. But it's worth knowing how much disagreement there is to begin with.
Rank-correlation: do the judges agree on which items are better than which? Insensitive to systematic bias (one judge rating 0.7 where another rates 0.5 is fine, as long as they preserve the ordering).
Same idea as ρ but counts concordant vs. discordant pairs of items. More conservative; less sensitive to outliers. A different lens on the same "do they rank things the same way" question.
Probability the observed mean disagreement |a − b| would arise if we randomly shuffled which score belongs to which judge. Small p → disagreement is structural, not noise.
Pairwise Spearman ρ heatmap
Each cell is the rank-correlation between two judges. Greener = more agreement on ordering. Diagonal is by definition 1.0 (judge vs itself). Empty cells = the two judges didn't both score any common items in this run.
claude-haiku | claude-sonnet | gpt-4o-mini | |
|---|---|---|---|
| claude-haiku | 1.00 | 0.49 | -0.13 |
| claude-sonnet | 0.49 | 1.00 | 0.11 |
| gpt-4o-mini | -0.13 | 0.11 | 1.00 |
Scatter + bootstrap CIs
One scatter per judge pair. Each dot is one item; the dashed line is "perfect agreement." The Spearman ρ and Kendall τ next to it come with 90% paired-bootstrap CIs. The framework never reports rank correlation as a point estimate.
How agreement changes by strategy
The same judges may agree more or less depending on which items they were asked to rate. Runs using bandit routing tend to select harder items more often, which can suppress correlation; the all-judges runs see every item.