Running an A/B Test¶
The Experiment class is the central entry point for frequentist A/B test analysis. It accepts raw data, infers the metric type, selects the appropriate statistical test, and returns a frozen ExperimentResult dataclass.
Basic usage¶
from splita import Experiment
import numpy as np
rng = np.random.default_rng(42)
ctrl = rng.normal(25.0, 8.0, size=1000)
trt = rng.normal(26.5, 8.0, size=1000)
result = Experiment(ctrl, trt).run()
The 6 test methods¶
splita supports 6 statistical tests. By default (method='auto'), it selects the best one based on your data.
1. Z-test (proportions)¶
Used automatically for binary (0/1) data. Tests the difference between two proportions.
ctrl = rng.binomial(1, 0.10, size=5000)
trt = rng.binomial(1, 0.115, size=5000)
result = Experiment(ctrl, trt, method='ztest').run()
print(result.method) # 'ztest'
print(result.effect_size) # Cohen's h
Note
Auto-detection selects the z-test when all values are 0 or 1.
2. Welch's t-test (continuous)¶
The default for continuous data. Uses Welch's correction for unequal variances (not Student's t-test).
ctrl = rng.normal(25.0, 8.0, size=1000)
trt = rng.normal(26.5, 8.0, size=1000)
result = Experiment(ctrl, trt, method='ttest').run()
print(result.method) # 'ttest'
print(result.effect_size) # Cohen's d
3. Mann-Whitney U (non-parametric)¶
Distribution-free test for when normality assumptions are violated. Tests whether one distribution stochastically dominates the other.
# Highly skewed data
ctrl = rng.exponential(10, size=500)
trt = rng.exponential(12, size=500)
result = Experiment(ctrl, trt, method='mannwhitney').run()
print(result.method) # 'mannwhitney'
4. Chi-square test (categorical)¶
Tests association between treatment assignment and a categorical outcome.
ctrl = rng.binomial(1, 0.10, size=5000)
trt = rng.binomial(1, 0.115, size=5000)
result = Experiment(ctrl, trt, method='chisquare').run()
print(result.method) # 'chisquare'
5. Delta method (ratio metrics)¶
For metrics defined as a ratio (e.g., revenue per session). Requires denominator arrays.
ctrl_num = rng.normal(50, 10, size=1000)
ctrl_den = rng.poisson(5, size=1000).astype(float) + 1
trt_num = rng.normal(55, 10, size=1000)
trt_den = rng.poisson(5, size=1000).astype(float) + 1
result = Experiment(
ctrl_num, trt_num,
metric='ratio',
method='delta',
control_denominator=ctrl_den,
treatment_denominator=trt_den,
).run()
print(result.method) # 'delta'
6. Bootstrap¶
Non-parametric resampling-based inference. Works for any metric type and makes no distributional assumptions.
result = Experiment(ctrl, trt, method='bootstrap', n_bootstrap=5000, random_state=42).run()
print(result.method) # 'bootstrap'
Tip
Bootstrap is slower but makes no assumptions about the data distribution. Use it when sample sizes are small or distributions are unusual.
When to use which¶
| Scenario | Recommended method |
|---|---|
| Binary conversion (0/1) | ztest (auto-selected) |
| Continuous metric, large sample | ttest (auto-selected) |
| Highly skewed or non-normal data | mannwhitney or bootstrap |
| Ratio metric (revenue/session) | delta |
| Small sample, any distribution | bootstrap |
| Categorical outcome | chisquare |
Configuration options¶
All configuration is keyword-only:
result = Experiment(
ctrl, trt,
metric='continuous', # 'auto', 'conversion', 'continuous', 'ratio'
method='ttest', # 'auto', 'ttest', 'ztest', 'mannwhitney', 'chisquare', 'delta', 'bootstrap'
alpha=0.05, # significance level
alternative='two-sided', # 'two-sided', 'greater', 'less'
).run()
Understanding the result¶
result.significant # bool -- is p < alpha?
result.pvalue # float -- the p-value
result.lift # float -- absolute difference (treatment - control)
result.relative_lift # str -- percentage lift ("6.00%")
result.ci # tuple -- confidence interval for the difference
result.effect_size # float -- standardized effect size
result.power # float -- post-hoc power estimate
result.control_mean # float
result.treatment_mean # float
result.metric # str -- detected metric type
result.method # str -- test used
result.to_dict() # dict -- JSON-serializable
One-sided tests¶
Test whether treatment is strictly better (or worse):
# Test: is treatment mean GREATER than control?
result = Experiment(ctrl, trt, alternative='greater').run()
# Test: is treatment mean LESS than control?
result = Experiment(ctrl, trt, alternative='less').run()
Multiple metrics¶
When testing multiple metrics in the same experiment, correct for multiple comparisons:
from splita import MultipleCorrection
results = [
Experiment(ctrl_conv, trt_conv).run(),
Experiment(ctrl_rev, trt_rev).run(),
Experiment(ctrl_engage, trt_engage).run(),
]
corrected = MultipleCorrection(
[r.pvalue for r in results],
labels=["conversion", "revenue", "engagement"],
).run()
print(corrected.rejected) # [True, False, False]
print(corrected.adjusted_pvalues) # Benjamini-Hochberg adjusted