Core Concepts¶
What is A/B testing?¶
A/B testing (also called split testing or randomized controlled experiments) is the practice of randomly assigning users to two or more groups and measuring the difference in a metric of interest. One group sees the current experience (control, or A), and the other sees a variation (treatment, or B).
The goal is to determine whether the observed difference is real (a true treatment effect) or just noise (random variation).
Why splita?¶
Most A/B testing tools are SaaS platforms that require sending your data to a third party. splita is a Python library that runs entirely on your infrastructure:
- Correct by default. Auto-detection picks the right test for your data. Welch's t-test (not Student's), unpooled standard errors, Cohen's h for proportions.
- Informative errors. Every
ValueErrortells you what went wrong, what the bad value was, and how to fix it. - Composable. Pipe
OutlierHandlerintoCUPEDintoExperiment-- each step is a clean function of its inputs. - No opinions on your data stack. splita takes arrays and returns dataclasses. NumPy in, dataclass out.
Key terminology¶
Metric types¶
| Term | Description | Example | splita class |
|---|---|---|---|
| Conversion | Binary outcome (0 or 1) | Did the user purchase? | Experiment(metric='conversion') |
| Continuous | Real-valued outcome | Revenue per user, session duration | Experiment(metric='continuous') |
| Ratio | Numerator / denominator metric | Revenue per session | Experiment(metric='ratio') |
Statistical concepts¶
| Term | Description |
|---|---|
| p-value | Probability of seeing a result this extreme if there were no real effect. Lower = more evidence against the null. |
| Alpha | Your threshold for declaring significance. Typically 0.05 (5% false positive rate). |
| Power | Probability of detecting a real effect when it exists. Typically 0.80 (80%). |
| MDE | Minimum Detectable Effect -- the smallest effect size you want to be able to detect. |
| Confidence interval | Range of plausible values for the true effect. A 95% CI means: if you repeated the experiment many times, 95% of the intervals would contain the true effect. |
| Effect size | Standardized measure of the difference (Cohen's d for means, Cohen's h for proportions). |
| SRM | Sample Ratio Mismatch -- when the actual split ratio differs from expected, indicating a data quality problem. |
| CUPED | Controlled-experiment Using Pre-Experiment Data -- a variance reduction technique. |
Frequentist vs Bayesian¶
splita supports both paradigms:
| Frequentist | Bayesian | |
|---|---|---|
| Question answered | "Is the effect statistically significant?" | "What is the probability that B is better than A?" |
| Key output | p-value, confidence interval | P(B > A), expected loss, credible interval |
| splita class | Experiment |
BayesianExperiment |
| When to use | Standard hypothesis testing, regulatory contexts | Decision-making under uncertainty, business contexts |
# Frequentist
from splita import Experiment
result = Experiment(ctrl, trt).run()
print(result.significant) # True/False
# Bayesian
from splita import BayesianExperiment
result = BayesianExperiment(ctrl, trt).run()
print(result.prob_treatment_better) # 0.97
print(result.expected_loss) # 0.001
Common result fields¶
splita result dataclasses share common field names, but some types use domain-specific naming to reflect their statistical meaning.
Standard fields (most result types)¶
| Field | Type | Meaning |
|---|---|---|
pvalue |
float |
Fixed-horizon p-value from a frequentist test. |
significant |
bool |
Whether pvalue < alpha. |
ci_lower / ci_upper |
float |
Lower and upper bounds of the confidence interval. |
lift |
float |
Absolute difference (treatment - control). |
alpha |
float |
Significance level used. |
Sequential testing fields (mSPRTState, mSPRTResult)¶
| Field | Standard equivalent | Why different |
|---|---|---|
always_valid_pvalue |
pvalue |
This is an always-valid p-value, valid at any stopping time. A regular p-value assumes fixed sample size, so reusing the name would be misleading. |
always_valid_ci_lower / always_valid_ci_upper |
ci_lower / ci_upper |
Always-valid confidence intervals that remain valid under continuous monitoring. |
should_stop |
significant |
In sequential testing, "should stop" is the actionable decision, not just "significant". |
Bayesian fields (BayesianResult)¶
| Field | Standard equivalent | Why different |
|---|---|---|
prob_b_beats_a |
(no direct equivalent) | Bayesian posterior probability, not a p-value. Values near 1.0 mean strong evidence for treatment. |
expected_loss_a / expected_loss_b |
(no direct equivalent) | Expected loss from choosing each variant. A decision-theoretic quantity with no frequentist analog. |
ci_lower / ci_upper |
same names | These are credible interval bounds (Bayesian), not confidence intervals, despite sharing the field names. |
Multiple testing correction fields (CorrectionResult)¶
| Field | Standard equivalent | Why different |
|---|---|---|
rejected |
significant |
After correction, "rejected" is the standard term for null hypotheses that were rejected. A metric can be "significant" in isolation but not "rejected" after correction. |
adjusted_pvalues |
pvalue |
These are corrected p-values, not raw p-values. |
Mapping guide¶
If you are writing generic code that processes different result types, here is a quick reference for finding the "p-value-like" and "significant-like" fields:
# Example: get the p-value from any result type
def get_pvalue(result):
if hasattr(result, 'pvalue'):
return result.pvalue
if hasattr(result, 'always_valid_pvalue'):
return result.always_valid_pvalue
if hasattr(result, 'logrank_pvalue'):
return result.logrank_pvalue # SurvivalResult
if hasattr(result, 'interaction_pvalue'):
return result.interaction_pvalue # InteractionResult
return None
The experimentation lifecycle¶
- Plan -- Use
SampleSizeto determine how many users you need. - Check -- Use
SRMCheckto verify data quality before analysis. - Reduce variance -- Use
CUPEDorOutlierHandlerto improve sensitivity. - Analyze -- Use
ExperimentorBayesianExperimentto measure the effect. - Monitor -- Use
mSPRTorGroupSequentialfor real-time decisions. - Explain -- Use
explain()andreport()to communicate results.
splita provides tools for every step.