splita — Methods Reference¶
Every method in splita is either an original implementation from a published paper, a wrapper around scipy/sklearn with added validation and A/B-testing-specific logic, or a hybrid that implements core algorithms from scratch but uses scipy/sklearn for numerical primitives (solvers, distributions).
Legend:
- Original: Algorithm implemented from the paper's equations. No delegation to existing statistical libraries for the core logic.
- Wrapper: Delegates the statistical computation to scipy or sklearn. splita adds validation, auto-detection, error messages, and result formatting.
- Hybrid: Core algorithm from paper + scipy/sklearn for numerical primitives (e.g., norm.ppf, linalg.solve).
Core Analysis (splita.core)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
Experiment (z-test) |
Hybrid | Pooled SE test statistic, unpooled SE for CI | Newcombe, R.G. (1998). "Two-sided confidence intervals for the single proportion." Statistics in Medicine, 17(8), 857-872. |
Experiment (t-test) |
Wrapper | Welch's t-test via scipy.stats.ttest_ind |
Welch, B.L. (1947). "The generalization of Student's problem when several different population variances are involved." Biometrika, 34(1-2), 28-35. |
Experiment (Mann-Whitney) |
Hybrid | P-value via scipy.stats.mannwhitneyu, Hodges-Lehmann estimator + Moses CI implemented from scratch |
Hodges, J.L. & Lehmann, E.L. (1963). "Estimates of location based on rank tests." Annals of Mathematical Statistics, 34(2), 598-611. Moses, L.E. (1965). "Confidence limits from rank tests." Technometrics, 7(2), 257-260. |
Experiment (chi-square) |
Wrapper | scipy.stats.chi2_contingency |
Pearson, K. (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables." Philosophical Magazine, 50(302), 157-175. |
Experiment (delta method) |
Original | Linearized ratio metric, Welch t-test on linearized values | Deng, A., Knoblich, U., & Lu, J. (2018). "Applying the Delta Method in Metric Analytics." KDD '18. |
Experiment (bootstrap) |
Original | Vectorized resampling, shifted bootstrap p-value, percentile CI | Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall. |
BayesianExperiment |
Original | Beta-Binomial and Normal-Inverse-Gamma conjugate posteriors, MC inference | Berry, D.A. (2006). "Bayesian clinical trials." Nature Reviews Drug Discovery, 5(1), 27-36. |
QuantileExperiment |
Original | Bootstrap inference for quantile differences at arbitrary quantiles | Efron, B. (1979). "Bootstrap methods: Another look at the jackknife." Annals of Statistics, 7(1), 1-26. |
SampleSize (proportion) |
Hybrid | Farrington-Manning formula with pooled/unpooled SE split | Farrington, C.P. & Manning, G. (1990). "Test statistics and sample size formulae for comparative binomial trials." Statistics in Medicine, 9(12), 1447-1454. |
SampleSize (mean) |
Hybrid | Two-sample t-test power formula | Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Lawrence Erlbaum. |
SampleSize (ratio) |
Original | Delta method variance for ratio metrics | Deng, A., Knoblich, U., & Lu, J. (2018). "Applying the Delta Method in Metric Analytics." KDD '18. |
SampleSize (MDE inverse) |
Hybrid | Numerical inversion via scipy.optimize.brentq |
Brent, R.P. (1973). Algorithms for Minimization Without Derivatives. Prentice-Hall. |
SRMCheck |
Wrapper | Chi-square goodness-of-fit via scipy.stats.chi2.sf |
Fabijan, A. et al. (2019). "Diagnosing Sample Ratio Mismatch in Online Controlled Experiments." WWW '19. |
MultipleCorrection (BH) |
Original | Step-up procedure with reverse monotonicity enforcement | Benjamini, Y. & Hochberg, Y. (1995). "Controlling the false discovery rate." JRSS-B, 57(1), 289-300. |
MultipleCorrection (Bonferroni) |
Original | p * n, capped at 1 | Bonferroni, C.E. (1936). "Teoria statistica delle classi e calcolo delle probabilita." Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze. |
MultipleCorrection (Holm) |
Original | Step-down procedure with forward monotonicity enforcement | Holm, S. (1979). "A simple sequentially rejective multiple test procedure." Scandinavian Journal of Statistics, 6(2), 65-70. |
MultipleCorrection (BY) |
Original | BH with harmonic number correction for dependent tests | Benjamini, Y. & Yekutieli, D. (2001). "The control of the false discovery rate in multiple testing under dependency." Annals of Statistics, 29(4), 1165-1188. |
PowerSimulation |
Wrapper | Monte Carlo simulation using Experiment internally |
— (standard simulation methodology) |
HTEEstimator (T-learner) |
Wrapper | Two separate sklearn models, CATE = E[Y | X,T=1] - E[Y |
HTEEstimator (S-learner) |
Wrapper | Single sklearn model with treatment indicator as feature | Kunzel, S.R. et al. (2019). Same as above. |
TriggeredExperiment |
Wrapper | ITT and per-protocol analysis via Experiment |
Hernan, M.A. & Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall. |
InteractionTest |
Hybrid | Per-segment experiments + Cochran's Q heterogeneity test | Cochran, W.G. (1954). "The combination of estimates from different experiments." Biometrics, 10(1), 101-129. |
MultiObjectiveExperiment |
Wrapper | Runs Experiment per metric + MultipleCorrection + Pareto analysis |
— (composite of existing methods) |
StratifiedExperiment |
Original | Neyman-style stratified difference-in-means with weighted variance | Neyman, J. (1923/1990). "On the application of probability theory to agricultural experiments." Statistical Science, 5(4), 465-472. Miratrix, L.W. et al. (2013). "Adjusting treatment effect estimates by post-stratification in randomized experiments." JRSS-B, 75(2), 369-396. |
CausalForest |
Hybrid | T-learner with sklearn.RandomForestRegressor + honest splitting + jackknife CI |
Athey, S., Tibshirani, J., & Wager, S. (2019). "Generalized Random Forests." Annals of Statistics, 47(2), 1148-1178. Wager, S. & Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests." JASA, 113(523), 1228-1242. |
Variance Reduction (splita.variance)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
CUPED |
Original | Y_adj = Y - theta*(X - mean(X)), theta = Cov(Y,X)/Var(X) | Deng, A. et al. (2013). "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data." WSDM '13. |
CUPAC |
Hybrid | Cross-validated ML predictions as CUPED covariate. sklearn for models. | Tang, D. et al. (2020). "An empirical evaluation of CUPAC." DoorDash Engineering Blog. Guo, Y. et al. (2021). "Machine Learning for Variance Reduction in Online Experiments." NeurIPS '21. |
OutlierHandler (winsorize/trim) |
Original | Percentile-based capping on pooled data | Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. (IQR rule) |
OutlierHandler (clustering) |
Wrapper | DBSCAN via sklearn.cluster.DBSCAN for outlier detection |
Ester, M. et al. (1996). "A density-based algorithm for discovering clusters." KDD '96. |
MultivariateCUPED |
Original | theta = Cov(Y,X) @ Var(X)^{-1}, multivariate extension | Deng, A. & Shi, X. (2016). "Optimal Variance Reduction for Online Controlled Experiments." Microsoft Technical Report. Poyarkov, A. et al. (2016). "Boosted Decision Tree Regression Adjustment for Variance Reduction." KDD '16. |
RegressionAdjustment |
Original | Fully-interacted OLS with HC2 robust standard errors | Lin, W. (2013). "Agnostic notes on regression adjustments to experimental data." Annals of Applied Statistics, 7(1), 295-318. |
AdaptiveWinsorizer |
Original | Grid-search optimal capping thresholds to minimize effect variance | Gupta, S. et al. (2019). "Top Challenges from the first Practical Online Controlled Experiments Summit." KDD '19. |
DoubleML |
Hybrid | Cross-fitted outcome + propensity models, influence-function SE. sklearn for models. | Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning for Treatment and Structural Parameters." Econometrics Journal, 21(1), C1-C68. |
Sequential Testing (splita.sequential)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
mSPRT |
Original | Mixture likelihood ratio, always-valid p-values, streaming API | Johari, R., Pekelis, L., & Walsh, D.J. (2017/2022). "Always Valid Inference: Continuous Monitoring of A/B Tests." Operations Research, 70(3), 1806-1821. (arXiv:1512.04922) |
GroupSequential |
Original | Conditional error spending boundaries (Lan-DeMets approach) | O'Brien, P.C. & Fleming, T.R. (1979). "A multiple testing procedure for clinical trials." Biometrics, 35(3), 549-556. Lan, K.K.G. & DeMets, D.L. (1983). "Discrete sequential boundaries for clinical trials." Biometrika, 70(3), 659-663. |
EValue |
Original | E-value = mixture likelihood ratio, always-valid testing | Vovk, V. & Wang, R. (2021). "E-values: Calibration, combination, and applications." Annals of Statistics, 49(3), 1736-1754. Grunwald, P., de Heide, R., & Koolen, W. (2020). "Safe Testing." arXiv:1906.07801. |
ConfidenceSequence |
Original | Time-uniform confidence sequences, tighter than mSPRT CIs | Howard, S.R. et al. (2021). "Time-uniform, nonparametric, nonasymptotic confidence sequences." Annals of Statistics, 49(2), 1055-1080. |
EProcess |
Original | Multiplicative e-value accumulation, GRAPA and universal methods | Grunwald, P., de Heide, R., & Koolen, W. (2020). "Safe Testing." arXiv:1906.07801. Ramdas, A. et al. (2023). "Game-theoretic Statistics and Safe Anytime-valid Inference." Statistical Science. |
Bandits (splita.bandits)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
ThompsonSampler |
Original | Beta-Binomial, Normal-Inverse-Gamma, Gamma-Poisson conjugate posteriors | Russo, D. et al. (2018). "A Tutorial on Thompson Sampling." Foundations and Trends in Machine Learning, 11(1), 1-96. Thompson, W.R. (1933). "On the likelihood that one unknown probability exceeds another." Biometrika, 25(3-4), 285-294. |
LinTS |
Original | Bayesian linear regression posterior with Cholesky sampling | Agrawal, S. & Goyal, N. (2013). "Thompson Sampling for Contextual Bandits with Linear Payoffs." ICML '13. |
LinUCB |
Original | Upper confidence bound on linear reward model | Li, L. et al. (2010). "A Contextual-Bandit Approach to Personalized News Article Recommendation." WWW '10. |
BayesianStopping |
Original | Standalone stopping rule evaluator (expected loss, prob best, precision) | — (standard Bayesian decision theory) |
Causal Inference (splita.causal)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
DifferenceInDifferences |
Hybrid | Classic two-period DiD with delta-method SE + parallel trends check | Card, D. & Krueger, A.B. (1994). "Minimum Wages and Employment." American Economic Review, 84(4), 772-793. Angrist, J.D. & Pischke, J.S. (2009). Mostly Harmless Econometrics. Princeton University Press. |
SyntheticControl |
Hybrid | Constrained optimization (SLSQP) for donor weights, pre/post comparison | Abadie, A., Diamond, A., & Hainmueller, J. (2010). "Synthetic Control Methods for Comparative Case Studies." JASA, 105(490), 493-505. |
ClusterExperiment |
Hybrid | Cluster-robust inference via cluster-mean collapse + Welch t-test + ICC | Cameron, A.C. & Miller, D.L. (2015). "A practitioner's guide to cluster-robust inference." Journal of Human Resources, 50(2), 317-372. |
SwitchbackExperiment |
Hybrid | Period-level averaging + t-test on period means | Bojinov, I. & Shephard, N. (2019). "Time series experiments and causal estimands." JASA, 114(528), 1477-1491. |
SurrogateEstimator |
Wrapper | sklearn model mapping short-term → long-term outcome | Athey, S. et al. (2019). "The Surrogate Index." NBER Working Paper 26463. |
SurrogateIndex |
Hybrid | Cross-fitted multi-surrogate index with delta-method CI | Athey, S., Chetty, R., Imbens, G.W., & Kang, H. (2019). "The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects." NBER Working Paper 26463. |
InterferenceExperiment |
Original | Horvitz-Thompson at cluster level with ICC-based design effect | Basse, G.W. & Feller, A. (2018). "Analyzing Two-Stage Experiments in the Presence of Interference." JASA, 113(521), 41-55. Hudgens, M.G. & Halloran, M.E. (2008). "Toward Causal Inference With Interference." JASA, 103(482), 832-842. |
Diagnostics (splita.diagnostics)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
NoveltyCurve |
Original | Rolling-window effect analysis with trend detection | — (standard diagnostic methodology, used at Booking.com, Microsoft) |
AATest |
Wrapper | Random-split simulations using Experiment to validate FP rate |
— (standard pre-experiment validation, described in Kohavi et al. 2020) |
EffectTimeSeries |
Hybrid | Cumulative experiment at each timestamp | — (standard diagnostic, described in Kohavi et al. 2020) |
MetricSensitivity |
Hybrid | Monte Carlo power estimation from historical variance | — (standard pre-experiment planning) |
VarianceEstimator |
Original | Distributional analysis with skewness/kurtosis diagnostics | — (standard descriptive statistics with A/B-specific recommendations) |
NonStationaryDetector |
Original | CUSUM-like change-point detection on effect time series | Page, E.S. (1954). "Continuous inspection schemes." Biometrika, 41(1-2), 100-115. (CUSUM) |
Experiment Design (splita.design)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
PairwiseDesign |
Original | Mahalanobis distance greedy matching for balanced assignment | Greevy, R. et al. (2004). "Optimal multivariate matching before randomization." Biostatistics, 5(2), 263-275. Mahalanobis, P.C. (1936). "On the generalised distance in statistics." Proceedings of the National Institute of Sciences of India, 2, 49-55. |
Experiment Governance (splita.governance)¶
| Class | Type | What it does | Reference |
|---|---|---|---|
ExperimentRegistry |
Original | In-memory experiment tracking with date filtering | — (operational tooling, no academic reference) |
ConflictDetector |
Original | Overlapping experiment detection (traffic, metric, segment conflicts) | — (operational tooling, described in Kohavi, Tang & Xu 2020) |
General References¶
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Lawrence Erlbaum.
- Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall.
- Angrist, J.D. & Pischke, J.S. (2009). Mostly Harmless Econometrics. Princeton University Press.