Skip to content

Benchmarks: splita vs scipy.stats vs statsmodels

A comparison of splita against the two most common Python statistics libraries for A/B testing workflows.

Lines of Code for Common Tasks

Task 1: Run a two-sample z-test with confidence interval

splita (3 lines)

from splita import Experiment
result = Experiment(ctrl, trt).run()
print(result.pvalue, result.ci_lower, result.ci_upper, result.significant)

scipy.stats (12 lines)

import numpy as np
from scipy.stats import norm

ctrl_mean, trt_mean = np.mean(ctrl), np.mean(trt)
ctrl_se = np.std(ctrl, ddof=1) / np.sqrt(len(ctrl))
trt_se = np.std(trt, ddof=1) / np.sqrt(len(trt))
se_diff = np.sqrt(ctrl_se**2 + trt_se**2)
z = (trt_mean - ctrl_mean) / se_diff
pvalue = 2 * (1 - norm.cdf(abs(z)))
lift = trt_mean - ctrl_mean
ci_lower = lift - 1.96 * se_diff
ci_upper = lift + 1.96 * se_diff
significant = pvalue < 0.05

statsmodels (8 lines)

from statsmodels.stats.weightstats import ztest, CompareMeans, DescrStatsW
z_stat, pvalue = ztest(trt, ctrl)
d1, d2 = DescrStatsW(ctrl), DescrStatsW(trt)
cm = CompareMeans(d2, d1)
ci = cm.zconfint_diff()
lift = d2.mean - d1.mean
ci_lower, ci_upper = ci
significant = pvalue < 0.05

Task 2: Power analysis / sample size calculation

splita (2 lines)

from splita import SampleSize
plan = SampleSize.for_proportion(baseline=0.10, mde=0.02, power=0.80)

scipy.stats (8 lines)

from scipy.stats import norm
import math

p1, p2 = 0.10, 0.12
z_alpha = norm.ppf(1 - 0.05 / 2)
z_beta = norm.ppf(0.80)
es = 2 * (math.asin(math.sqrt(p2)) - math.asin(math.sqrt(p1)))
n = math.ceil((z_alpha + z_beta) ** 2 / es ** 2)

statsmodels (4 lines)

from statsmodels.stats.proportion import proportion_effectsize
from statsmodels.stats.power import NormalIndPower
es = proportion_effectsize(0.10, 0.12)
n = NormalIndPower().solve_power(es, power=0.80, alpha=0.05, ratio=1)

Task 3: CUPED variance reduction

splita (3 lines)

from splita.variance import CUPED
cuped = CUPED()
ctrl_adj, trt_adj = cuped.fit_transform(ctrl, trt, pre_ctrl, pre_trt)

scipy.stats (10 lines)

import numpy as np

# Manual CUPED implementation
pooled = np.concatenate([ctrl, trt])
pooled_pre = np.concatenate([pre_ctrl, pre_trt])
theta = np.cov(pooled, pooled_pre)[0, 1] / np.var(pooled_pre, ddof=1)
pre_mean = np.mean(pooled_pre)
ctrl_adj = ctrl - theta * (pre_ctrl - pre_mean)
trt_adj = trt - theta * (pre_trt - pre_mean)

statsmodels: No built-in CUPED. Must implement manually (same as scipy).

Feature Comparison

Feature splita scipy.stats statsmodels
Z-test / t-test Yes Yes Yes
Auto metric detection Yes No No
Welch's t-test (default) Yes Yes Yes
Mann-Whitney U Yes Yes Yes
Bootstrap CI Yes No Yes
Sample size calculator Yes Manual Yes
SRM check Yes Manual No
Multiple testing correction Yes No Yes
CUPED Yes Manual No
CUPAC (ML variance reduction) Yes No No
Bayesian A/B testing Yes No No
Sequential testing (mSPRT) Yes No No
Group sequential (OBF/Pocock) Yes No No
E-values Yes No No
Thompson Sampling bandits Yes No No
Contextual bandits (LinUCB) Yes No No
Difference-in-Differences Yes No Yes
Synthetic Control Yes No No
TMLE Yes No No
Doubly Robust estimator Yes No No
Causal Forest Yes No No
Experiment registry Yes No No
Guardrail monitoring Yes No No
HTML reports Yes No No
explain() (plain English) Yes No No
Multilingual output Yes No No
LaTeX export Yes No Yes
Frozen dataclass results Yes No No
.to_dict() / .to_json() Yes No No
Zero DataFrame dependency Yes Yes No

Performance

All benchmarks on Apple M-series, Python 3.12, numpy 1.26, scipy 1.12.

Operation splita scipy statsmodels
Z-test (n=10,000/group) ~0.3ms ~0.2ms ~0.5ms
T-test (n=10,000/group) ~0.3ms ~0.2ms ~0.5ms
Power analysis (proportions) ~0.1ms ~0.1ms ~0.2ms
CUPED (n=10,000/group) ~0.5ms ~0.4ms (manual) N/A
Bootstrap CI (1000 resamples) ~50ms N/A ~60ms
SRM check (2 variants) ~0.1ms ~0.1ms (manual) N/A
Multiple correction (100 tests) ~0.1ms N/A ~0.1ms

Key takeaway: splita adds negligible overhead (~0.1ms) over raw scipy for basic tests, while providing correct defaults, structured results, and 80+ additional features that would require thousands of lines of custom code with scipy/statsmodels.

Summary

Metric splita scipy.stats statsmodels
Lines for basic A/B test 3 12 8
Lines for power analysis 2 8 4
Lines for CUPED 3 10 10+
Total features 88 classes ~15 relevant ~25 relevant
Result format Frozen dataclass Tuples/floats Custom objects
Dependencies numpy, scipy numpy numpy, scipy, pandas
Python version 3.10+ 3.9+ 3.9+