Skip to content

Bandits

When you want to minimize regret rather than just measure a difference, use bandit algorithms. They shift traffic toward the winning variant as data arrives, reducing the cost of experimentation.

ThompsonSampler

Multi-armed Thompson Sampling for Bernoulli, Gaussian, or Poisson rewards.

from splita import ThompsonSampler
import numpy as np

rng = np.random.default_rng(42)
true_rates = [0.05, 0.07, 0.06]  # arm 1 is best

ts = ThompsonSampler(n_arms=3, random_state=42)
for _ in range(1000):
    arm = ts.recommend()
    reward = rng.binomial(1, true_rates[arm])
    ts.update(arm, reward)

result = ts.result()
print(result.current_best_arm)  # 1
print(result.prob_best)         # [~0.01, ~0.95, ~0.04]
print(result.should_stop)       # True (expected loss below threshold)
print(result.total_reward)      # cumulative reward

Gaussian rewards

ts = ThompsonSampler(n_arms=2, reward_type='gaussian', random_state=42)
for _ in range(500):
    arm = ts.recommend()
    reward = rng.normal([10, 12][arm], 2)
    ts.update(arm, reward)

result = ts.result()
print(result.current_best_arm)

LinTS (Linear Thompson Sampling)

Contextual bandit that uses context features to personalize arm selection.

from splita import LinTS
import numpy as np

rng = np.random.default_rng(42)
n_arms = 3
d = 5  # context dimension

# True weight vectors per arm
true_weights = rng.normal(0, 1, (n_arms, d))

lints = LinTS(n_arms=n_arms, n_features=d, random_state=42)

for _ in range(2000):
    context = rng.normal(0, 1, d)
    arm = lints.recommend(context)
    reward = context @ true_weights[arm] + rng.normal(0, 0.1)
    lints.update(arm, context, reward)

result = lints.result()
print(result.current_best_arm)

LinUCB

Upper Confidence Bound contextual bandit. More exploitative than LinTS.

from splita import LinUCB
import numpy as np

rng = np.random.default_rng(42)
n_arms = 3
d = 5

true_weights = rng.normal(0, 1, (n_arms, d))

linucb = LinUCB(n_arms=n_arms, n_features=d, alpha=1.0)

for _ in range(2000):
    context = rng.normal(0, 1, d)
    arm = linucb.recommend(context)
    reward = context @ true_weights[arm] + rng.normal(0, 0.1)
    linucb.update(arm, context, reward)

result = linucb.result()
print(result.current_best_arm)

BayesianStopping

Evaluate stopping rules for bandit experiments:

from splita import BayesianStopping, ThompsonSampler
import numpy as np

rng = np.random.default_rng(42)

ts = ThompsonSampler(n_arms=2, random_state=42)
for _ in range(500):
    arm = ts.recommend()
    reward = rng.binomial(1, [0.10, 0.12][arm])
    ts.update(arm, reward)

stopping = BayesianStopping()
result = stopping.evaluate(ts)
print(result.should_stop)
print(result.prob_best)
print(result.expected_remaining_loss)

OfflineEvaluator

Evaluate a new policy using historical logged data (Inverse Propensity Scoring and Doubly Robust estimation).

from splita import OfflineEvaluator
import numpy as np

rng = np.random.default_rng(42)
n = 5000
n_arms = 3

# Historical data
contexts = rng.normal(0, 1, (n, 5))
actions = rng.integers(0, n_arms, n)
rewards = rng.binomial(1, 0.1, n).astype(float)
propensities = np.full(n, 1.0 / n_arms)

evaluator = OfflineEvaluator()
result = evaluator.evaluate(
    contexts=contexts,
    actions=actions,
    rewards=rewards,
    propensities=propensities,
    new_policy=lambda ctx: 1,  # always pick arm 1
)
print(result.ips_estimate)
print(result.dr_estimate)

A/B test vs bandit: when to use which

Scenario Recommendation
Need a clean measurement of the effect A/B test (Experiment)
Want to minimize regret during the test Bandit (ThompsonSampler)
Regulatory or scientific context A/B test
Personalization (different best arm per user) Contextual bandit (LinTS, LinUCB)
Short-lived promotions or campaigns Bandit
Need to evaluate a policy offline OfflineEvaluator