methods

Math & citations

One section per implemented method, with the paper(s) it cites and the Python module the implementation lives in. The framework is built around the principle that every statistical claim is paper-grounded and the residuals are auditable.

Split conformal prediction

panoptes.uq.conformal_split

shipped

Finite-sample marginal coverage ≥ 1 − α using the ceil((n+1)(1−α))/n quantile correction. Bounded clip to [0, 1] on the rubric scale.

references

Papadopoulos, Proedrou, Vovk, Gammerman (2002). Inductive Confidence Machines for Regression. ECML.
Vovk, Gammerman, Shafer (2005). Algorithmic Learning in a Random World. Springer.
Angelopoulos, Bates (2023). A Gentle Introduction to Conformal Prediction. arXiv:2107.07511.

Conformalized Quantile Regression (CQR)

panoptes.uq.conformal_adaptive

shipped

Input-adaptive intervals via sklearn GradientBoostingRegressor(loss='quantile') on judge-output features. Width shrinks where the quantile regressors are confident.

references

Romano, Patterson, Candès (2019). Conformalized Quantile Regression. NeurIPS.

Mondrian / group-conditional conformal

panoptes.uq.conformal_mondrian

shipped

Per-task-family quantiles. Conditional coverage P(Y ∈ C(X) | g(X) = g) ≥ 1 − α within each group. Falls back to pooled marginal when n_group < 50.

references

Vovk, Lindsay, Nouretdinov, Gammerman (2003). Mondrian Confidence Machine.

Semantic entropy

panoptes.uq.semantic_entropy

shipped

Bidirectional NLI clustering of temperature samples, Shannon entropy over cluster sizes bounded in [0, log N]. Two backends: local DeBERTa-v3-large-mnli (HF) and LLM-as-NLI.

references

Farquhar, Kossen, Kuhn, Gal (2024). Detecting hallucinations in large language models using semantic entropy. Nature.

Self-consistency variance

panoptes.uq.self_consistency

shipped

MC variance + IQR + Bayesian bootstrap CI (Dirichlet(1,...,1) weights) over n temperature samples per (judge, item) pair.

references

Wang, Wei, Schuurmans, Le, Chi, et al. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR.
Rubin (1981). The Bayesian Bootstrap. Annals of Statistics 9(1).

Hierarchical-Gaussian jury aggregation

panoptes.uq.disagreement

shipped

Closed-form EM for score_ij = θ_i + bias_j + ε_ij with ε_ij ~ N(0, σ_j²). Recovers per-item posterior over latent quality θ plus per-judge bias and precision. Identifiability via Σ_j bias_j = 0.

references

Dawid, Skene (1979). Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. JRSS-C.
Hovy, Berg-Kirkpatrick, Vaswani, Hovy (2013). Learning Whom to Trust with MACE. NAACL.

Aleatoric / epistemic decomposition

panoptes.uq.decomposition

shipped

Var_total = E_j[Var(score | judge=j)] + Var_j[E(score | judge=j)]. Nested resampling: outer over judges (epistemic), inner over temperature samples (aleatoric).

references

Kendall, Gal (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS.
Depeweg, Hernández-Lobato, Doshi-Velez, Udluft (2018). Decomposition of Uncertainty in Bayesian Deep Learning. ICML.

Thompson-sampling jury routing

panoptes.routing.bandit

shipped

Beta(α, β) per (judge, task_family) arm. Reward = epistemic-variance reduction / dollars. Online updates after each item. State serializable for warm-start across runs.

references

Russo, Van Roy, Kazerouni, Osband, Wen (2018). A Tutorial on Thompson Sampling. arXiv:1707.02038.
Chapelle, Li (2011). An Empirical Evaluation of Thompson Sampling. NeurIPS.

Coverage / calibration diagnostics

panoptes.stats

shipped

Marginal coverage with Clopper-Pearson CIs; conditional coverage per task family; reliability diagram with bootstrap bands; ECE / MCE / Brier; paired-bootstrap Spearman/Kendall + permutation test for judge disagreement.

references

Naeini, Cooper, Hauskrecht (2015). ECE / MCE.
Gneiting, Raftery (2007). Sharpness vs calibration framing.
Bröcker, Smith (2007). Reliability bootstrap bands.