judge agreement

Do the judges agree with each other?

PANOPTES routinely runs three different LLM judges (Claude, GPT-4o, Gemini) on the same task. If they're all rating the same latent quality on the same scale, their scores should be tightly correlated. If they're not, the framework's job is to give us a principled way to combine their disagreement into a posterior. But it's worth knowing how much disagreement there is to begin with.

Spearman ρ

Rank-correlation: do the judges agree on which items are better than which? Insensitive to systematic bias (one judge rating 0.7 where another rates 0.5 is fine, as long as they preserve the ordering).

Kendall τ

Same idea as ρ but counts concordant vs. discordant pairs of items. More conservative; less sensitive to outliers. A different lens on the same "do they rank things the same way" question.

Permutation p

Probability the observed mean disagreement |a − b| would arise if we randomly shuffled which score belongs to which judge. Small p → disagreement is structural, not noise.

at a glance

Pairwise Spearman ρ heatmap

Each cell is the rank-correlation between two judges. Greener = more agreement on ordering. Diagonal is by definition 1.0 (judge vs itself). Empty cells = the two judges didn't both score any common items in this run.

from run panoptes-696da4d5 · strategy all

	claude-haiku	claude-sonnet	gpt-4o-mini
claude-haiku	1.00	0.49	-0.13
claude-sonnet	0.49	1.00	0.11
gpt-4o-mini	-0.13	0.11	1.00

high agreementpartiallow / no signaldisagreement

per pair

Scatter + bootstrap CIs

One scatter per judge pair. Each dot is one item; the dashed line is "perfect agreement." The Spearman ρ and Kendall τ next to it come with 90% paired-bootstrap CIs. The framework never reports rank correlation as a point estimate.

claude-haikuvsclaude-sonnetn=22

Spearman ρ

0.485

[n/a, n/a]

Kendall τ

0.458

[n/a, n/a]

permutation p

1.000

obs 0.089

claude-haikuvsgpt-4o-minin=22

Spearman ρ

-0.131

[n/a, n/a]

Kendall τ

-0.126

[n/a, n/a]

permutation p

1.000

obs 0.130

claude-sonnetvsgpt-4o-minin=22

Spearman ρ

0.115

[n/a, n/a]

Kendall τ

0.111

[n/a, n/a]

permutation p

1.000

obs 0.055

by run

How agreement changes by strategy

The same judges may agree more or less depending on which items they were asked to rate. Runs using bandit routing tend to select harder items more often, which can suppress correlation; the all-judges runs see every item.

panoptes-696da4d5

all

claude-haiku↔claude-sonnetρ = 0.485

claude-haiku↔gpt-4o-miniρ = -0.131

claude-sonnet↔gpt-4o-miniρ = 0.115

panoptes-d636a93f

bandit

claude-haiku↔gpt-4o-miniρ = 0.124

claude-sonnet↔gpt-4o-miniρ = -0.188

panoptes-e86ef9e3

all

claude-sonnet↔gpt-4oρ = 0.503

panoptes-44c4e9b3

all

claude-sonnet↔gpt-4oρ = 0.150