Running an A/B Test¶

The Experiment class is the central entry point for frequentist A/B test analysis. It accepts raw data, infers the metric type, selects the appropriate statistical test, and returns a frozen ExperimentResult dataclass.

Basic usage¶

from splita import Experiment
import numpy as np

rng = np.random.default_rng(42)
ctrl = rng.normal(25.0, 8.0, size=1000)
trt = rng.normal(26.5, 8.0, size=1000)

result = Experiment(ctrl, trt).run()

The 6 test methods¶

splita supports 6 statistical tests. By default (method='auto'), it selects the best one based on your data.

1. Z-test (proportions)¶

Used automatically for binary (0/1) data. Tests the difference between two proportions.

ctrl = rng.binomial(1, 0.10, size=5000)
trt = rng.binomial(1, 0.115, size=5000)

result = Experiment(ctrl, trt, method='ztest').run()
print(result.method)       # 'ztest'
print(result.effect_size)  # Cohen's h

Note

Auto-detection selects the z-test when all values are 0 or 1.

2. Welch's t-test (continuous)¶

The default for continuous data. Uses Welch's correction for unequal variances (not Student's t-test).

ctrl = rng.normal(25.0, 8.0, size=1000)
trt = rng.normal(26.5, 8.0, size=1000)

result = Experiment(ctrl, trt, method='ttest').run()
print(result.method)       # 'ttest'
print(result.effect_size)  # Cohen's d

3. Mann-Whitney U (non-parametric)¶

Distribution-free test for when normality assumptions are violated. Tests whether one distribution stochastically dominates the other.

# Highly skewed data
ctrl = rng.exponential(10, size=500)
trt = rng.exponential(12, size=500)

result = Experiment(ctrl, trt, method='mannwhitney').run()
print(result.method)  # 'mannwhitney'

4. Chi-square test (categorical)¶

Tests association between treatment assignment and a categorical outcome.

ctrl = rng.binomial(1, 0.10, size=5000)
trt = rng.binomial(1, 0.115, size=5000)

result = Experiment(ctrl, trt, method='chisquare').run()
print(result.method)  # 'chisquare'

5. Delta method (ratio metrics)¶

For metrics defined as a ratio (e.g., revenue per session). Requires denominator arrays.

ctrl_num = rng.normal(50, 10, size=1000)
ctrl_den = rng.poisson(5, size=1000).astype(float) + 1
trt_num = rng.normal(55, 10, size=1000)
trt_den = rng.poisson(5, size=1000).astype(float) + 1

result = Experiment(
    ctrl_num, trt_num,
    metric='ratio',
    method='delta',
    control_denominator=ctrl_den,
    treatment_denominator=trt_den,
).run()
print(result.method)  # 'delta'

6. Bootstrap¶

Non-parametric resampling-based inference. Works for any metric type and makes no distributional assumptions.

result = Experiment(ctrl, trt, method='bootstrap', n_bootstrap=5000, random_state=42).run()
print(result.method)  # 'bootstrap'

Tip

Bootstrap is slower but makes no assumptions about the data distribution. Use it when sample sizes are small or distributions are unusual.

When to use which¶

Scenario	Recommended method
Binary conversion (0/1)	`ztest` (auto-selected)
Continuous metric, large sample	`ttest` (auto-selected)
Highly skewed or non-normal data	`mannwhitney` or `bootstrap`
Ratio metric (revenue/session)	`delta`
Small sample, any distribution	`bootstrap`
Categorical outcome	`chisquare`

Configuration options¶

All configuration is keyword-only:

result = Experiment(
    ctrl, trt,
    metric='continuous',       # 'auto', 'conversion', 'continuous', 'ratio'
    method='ttest',            # 'auto', 'ttest', 'ztest', 'mannwhitney', 'chisquare', 'delta', 'bootstrap'
    alpha=0.05,                # significance level
    alternative='two-sided',   # 'two-sided', 'greater', 'less'
).run()

Understanding the result¶

result.significant      # bool -- is p < alpha?
result.pvalue           # float -- the p-value
result.lift             # float -- absolute difference (treatment - control)
result.relative_lift    # str -- percentage lift ("6.00%")
result.ci               # tuple -- confidence interval for the difference
result.effect_size      # float -- standardized effect size
result.power            # float -- post-hoc power estimate
result.control_mean     # float
result.treatment_mean   # float
result.metric           # str -- detected metric type
result.method           # str -- test used
result.to_dict()        # dict -- JSON-serializable

One-sided tests¶

Test whether treatment is strictly better (or worse):

# Test: is treatment mean GREATER than control?
result = Experiment(ctrl, trt, alternative='greater').run()

# Test: is treatment mean LESS than control?
result = Experiment(ctrl, trt, alternative='less').run()

Multiple metrics¶

When testing multiple metrics in the same experiment, correct for multiple comparisons:

from splita import MultipleCorrection

results = [
    Experiment(ctrl_conv, trt_conv).run(),
    Experiment(ctrl_rev, trt_rev).run(),
    Experiment(ctrl_engage, trt_engage).run(),
]

corrected = MultipleCorrection(
    [r.pvalue for r in results],
    labels=["conversion", "revenue", "engagement"],
).run()
print(corrected.rejected)          # [True, False, False]
print(corrected.adjusted_pvalues)  # Benjamini-Hochberg adjusted