Math & citations
One section per implemented method, with the paper(s) it cites and the Python module the implementation lives in. The framework is built around the principle that every statistical claim is paper-grounded and the residuals are auditable.
Finite-sample marginal coverage ≥ 1 − α using the ceil((n+1)(1−α))/n quantile correction. Bounded clip to [0, 1] on the rubric scale.
- Papadopoulos, Proedrou, Vovk, Gammerman (2002). Inductive Confidence Machines for Regression. ECML.
- Vovk, Gammerman, Shafer (2005). Algorithmic Learning in a Random World. Springer.
- Angelopoulos, Bates (2023). A Gentle Introduction to Conformal Prediction. arXiv:2107.07511.
Input-adaptive intervals via sklearn GradientBoostingRegressor(loss='quantile') on judge-output features. Width shrinks where the quantile regressors are confident.
- Romano, Patterson, Candès (2019). Conformalized Quantile Regression. NeurIPS.
Per-task-family quantiles. Conditional coverage P(Y ∈ C(X) | g(X) = g) ≥ 1 − α within each group. Falls back to pooled marginal when n_group < 50.
- Vovk, Lindsay, Nouretdinov, Gammerman (2003). Mondrian Confidence Machine.
Bidirectional NLI clustering of temperature samples, Shannon entropy over cluster sizes bounded in [0, log N]. Two backends: local DeBERTa-v3-large-mnli (HF) and LLM-as-NLI.
- Farquhar, Kossen, Kuhn, Gal (2024). Detecting hallucinations in large language models using semantic entropy. Nature.
MC variance + IQR + Bayesian bootstrap CI (Dirichlet(1,...,1) weights) over n temperature samples per (judge, item) pair.
- Wang, Wei, Schuurmans, Le, Chi, et al. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR.
- Rubin (1981). The Bayesian Bootstrap. Annals of Statistics 9(1).
Closed-form EM for score_ij = θ_i + bias_j + ε_ij with ε_ij ~ N(0, σ_j²). Recovers per-item posterior over latent quality θ plus per-judge bias and precision. Identifiability via Σ_j bias_j = 0.
- Dawid, Skene (1979). Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. JRSS-C.
- Hovy, Berg-Kirkpatrick, Vaswani, Hovy (2013). Learning Whom to Trust with MACE. NAACL.
Var_total = E_j[Var(score | judge=j)] + Var_j[E(score | judge=j)]. Nested resampling: outer over judges (epistemic), inner over temperature samples (aleatoric).
- Kendall, Gal (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS.
- Depeweg, Hernández-Lobato, Doshi-Velez, Udluft (2018). Decomposition of Uncertainty in Bayesian Deep Learning. ICML.
Beta(α, β) per (judge, task_family) arm. Reward = epistemic-variance reduction / dollars. Online updates after each item. State serializable for warm-start across runs.
- Russo, Van Roy, Kazerouni, Osband, Wen (2018). A Tutorial on Thompson Sampling. arXiv:1707.02038.
- Chapelle, Li (2011). An Empirical Evaluation of Thompson Sampling. NeurIPS.
Marginal coverage with Clopper-Pearson CIs; conditional coverage per task family; reliability diagram with bootstrap bands; ECE / MCE / Brier; paired-bootstrap Spearman/Kendall + permutation test for judge disagreement.
- Naeini, Cooper, Hauskrecht (2015). ECE / MCE.
- Gneiting, Raftery (2007). Sharpness vs calibration framing.
- Bröcker, Smith (2007). Reliability bootstrap bands.