PANOPTES
methods

Math & citations

One section per implemented method, with the paper(s) it cites and the Python module the implementation lives in. The framework is built around the principle that every statistical claim is paper-grounded and the residuals are auditable.

Split conformal prediction
panoptes.uq.conformal_split
shipped

Finite-sample marginal coverage ≥ 1 − α using the ceil((n+1)(1−α))/n quantile correction. Bounded clip to [0, 1] on the rubric scale.

references
  • Papadopoulos, Proedrou, Vovk, Gammerman (2002). Inductive Confidence Machines for Regression. ECML.
  • Vovk, Gammerman, Shafer (2005). Algorithmic Learning in a Random World. Springer.
  • Angelopoulos, Bates (2023). A Gentle Introduction to Conformal Prediction. arXiv:2107.07511.
Conformalized Quantile Regression (CQR)
panoptes.uq.conformal_adaptive
shipped

Input-adaptive intervals via sklearn GradientBoostingRegressor(loss='quantile') on judge-output features. Width shrinks where the quantile regressors are confident.

references
  • Romano, Patterson, Candès (2019). Conformalized Quantile Regression. NeurIPS.
Mondrian / group-conditional conformal
panoptes.uq.conformal_mondrian
shipped

Per-task-family quantiles. Conditional coverage P(Y ∈ C(X) | g(X) = g) ≥ 1 − α within each group. Falls back to pooled marginal when n_group < 50.

references
  • Vovk, Lindsay, Nouretdinov, Gammerman (2003). Mondrian Confidence Machine.
Semantic entropy
panoptes.uq.semantic_entropy
shipped

Bidirectional NLI clustering of temperature samples, Shannon entropy over cluster sizes bounded in [0, log N]. Two backends: local DeBERTa-v3-large-mnli (HF) and LLM-as-NLI.

references
  • Farquhar, Kossen, Kuhn, Gal (2024). Detecting hallucinations in large language models using semantic entropy. Nature.
Self-consistency variance
panoptes.uq.self_consistency
shipped

MC variance + IQR + Bayesian bootstrap CI (Dirichlet(1,...,1) weights) over n temperature samples per (judge, item) pair.

references
  • Wang, Wei, Schuurmans, Le, Chi, et al. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR.
  • Rubin (1981). The Bayesian Bootstrap. Annals of Statistics 9(1).
Hierarchical-Gaussian jury aggregation
panoptes.uq.disagreement
shipped

Closed-form EM for score_ij = θ_i + bias_j + ε_ij with ε_ij ~ N(0, σ_j²). Recovers per-item posterior over latent quality θ plus per-judge bias and precision. Identifiability via Σ_j bias_j = 0.

references
  • Dawid, Skene (1979). Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. JRSS-C.
  • Hovy, Berg-Kirkpatrick, Vaswani, Hovy (2013). Learning Whom to Trust with MACE. NAACL.
Aleatoric / epistemic decomposition
panoptes.uq.decomposition
shipped

Var_total = E_j[Var(score | judge=j)] + Var_j[E(score | judge=j)]. Nested resampling: outer over judges (epistemic), inner over temperature samples (aleatoric).

references
  • Kendall, Gal (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS.
  • Depeweg, Hernández-Lobato, Doshi-Velez, Udluft (2018). Decomposition of Uncertainty in Bayesian Deep Learning. ICML.
Thompson-sampling jury routing
panoptes.routing.bandit
shipped

Beta(α, β) per (judge, task_family) arm. Reward = epistemic-variance reduction / dollars. Online updates after each item. State serializable for warm-start across runs.

references
  • Russo, Van Roy, Kazerouni, Osband, Wen (2018). A Tutorial on Thompson Sampling. arXiv:1707.02038.
  • Chapelle, Li (2011). An Empirical Evaluation of Thompson Sampling. NeurIPS.
Coverage / calibration diagnostics
panoptes.stats
shipped

Marginal coverage with Clopper-Pearson CIs; conditional coverage per task family; reliability diagram with bootstrap bands; ECE / MCE / Brier; paired-bootstrap Spearman/Kendall + permutation test for judge disagreement.

references
  • Naeini, Cooper, Hauskrecht (2015). ECE / MCE.
  • Gneiting, Raftery (2007). Sharpness vs calibration framing.
  • Bröcker, Smith (2007). Reliability bootstrap bands.