Release Decision Engine/Bayesian A/B Testing

Bayesian A/B Testing Without a Data Team

Beta-Binomial explained for builders: how to read results, choose confidence thresholds, and make decisions with small samples when the effect size is large enough. No statistics degree required.

10 min read·Updated March 2026

VisualReading

TL;DR

▸Bayesian A/B testing gives you a probability statement: "there is a 99.8% chance that treatment beats control." It is more intuitive than p-values and actionable at smaller sample sizes.
▸The Beta-Binomial model is the right tool for binary conversion metrics (clicked / did not click). It requires only four numbers: n and k for each variant.
▸A large effect size can make a small sample decisive. FeatBit's hero experiment had 60 per variant — below the 200 minimum — but the 3× conversion difference was robust enough to act on.
▸Set your confidence threshold before the experiment. P ≥ 95% for standard decisions. P ≥ 99% for high-stakes or irreversible changes.

Bayesian vs Frequentist

Traditional A/B testing uses frequentist statistics: you collect data, compute a p-value, and reject the null hypothesis if p < 0.05. The p-value answers: "If there were no difference, how likely would we be to see data this extreme?" That framing is opaque to most builders.

Bayesian testing answers a more useful question: "Given the data we have, what is the probability that treatment is better than control?" That number — 72%, 95%, 99.8% — is directly interpretable. A product manager, an engineer, or a founder can act on it.

	Frequentist	Bayesian
Output	p-value (probability of data given null)	P(treatment > control)
Interpretability	Requires statistical training	Directly interpretable
Early stopping	Inflates false positive rate	Naturally handles peeking
Small samples	Requires minimum sample for validity	Works with any n, with appropriate uncertainty
Decision framing	Reject or fail to reject null	CONTINUE / PAUSE / ROLLBACK / INCONCLUSIVE

The Beta-Binomial Model

For binary conversion metrics — clicked or didn't click, signed up or didn't — the Beta-Binomial is the standard Bayesian model. Here is how it works in plain terms:

1.
Collect four numbers: For each variant, you need n (total sessions exposed) and k (sessions that converted). For FeatBit's hero experiment: control n=62, k=6; treatment n=60, k=18.
2.
Model each variant as a Beta distribution: Beta(k+1, n-k+1) gives a probability distribution over the true conversion rate. The peak is near the observed rate, and the width reflects uncertainty (narrower with more data).
3.
Run Monte Carlo sampling: Draw 100,000 samples from each distribution. Count the fraction of draws where treatment > control. That fraction is P(treatment wins).
4.
Interpret: P(treatment wins) = 99.8% means that across all plausible true conversion rates consistent with the data, treatment beats control 998 times out of 1,000.

How to Read Results

A Bayesian analysis produces three numbers worth reading:

P(treatment wins)99.8%

Probability that treatment has a higher true conversion rate than control. Your primary decision signal.

Relative lift+210%

How much higher the treatment rate is, relative to control. Indicative of magnitude, not precise at small samples.

Absolute rates9.7% → 30.0%

The observed conversion rates for each variant. The baseline rate matters for setting future experiment targets.

Sample Size and Effect Size

Sample size minimum rules exist to ensure that a result is precise, not just directionally correct. But effect size and sample size trade off: a 3× conversion difference (like FeatBit's hero experiment) is far more robust at n=60 than a 3% improvement at n=60.

The practical rule: when P(treatment wins) is above 99% and the effect size is large (relative lift > 50%), the direction is reliable even below the minimum sample. Treat the exact magnitude as indicative. Do not use a small-sample result to justify a +210% revenue projection, but do use it to make a CONTINUE decision with confidence.

Sample size guidance

Minimum per variant (direction reliable)50–100

Minimum per variant (magnitude reliable)200+

Exception: large effect size (>50% lift)Direction reliable at n=60+

Confidence Thresholds

The threshold for action must be set before the experiment starts — not after you see the data. Two common thresholds:

P ≥ 95%

Standard decisions

UI changes, content experiments, feature rollouts with easy rollback. Acceptable for most product experiments where the downside is recoverable.

P ≥ 99%

High-stakes decisions

Pricing changes, checkout flow changes, infrastructure modifications with costly rollback. Higher threshold for changes where being wrong is expensive.

FAQ

Can I use Bayesian testing for continuous metrics (like revenue per user)?

Yes, but the model changes. For continuous metrics, use a Normal-Normal or Gamma model instead of Beta-Binomial. The interpretation is the same — P(treatment > control) — but the calculation is different. For most product experiments, conversion metrics are simpler and more reliable than revenue metrics at small sample sizes.

What does it mean if P(treatment wins) is 72%?

It means the data is consistent with treatment winning, but not decisively. 72% is not sufficient to act — it would be wrong 28% of the time. The right frame is INCONCLUSIVE: wait for more data, or consider whether the hypothesis is testing the right thing.

Why not just use a p-value like everyone else?

You can. But p-values are frequently misinterpreted. A p-value of 0.04 does not mean 'there is a 96% chance treatment is better' — it means 'if there were no effect, we would see data this extreme 4% of the time.' Bayesian P(treatment > control) is what most people think p-values mean. Use the interpretation that your team will actually act on correctly.

Continue reading

See Bayesian in action: 210% case study Measurement Design The Release Decision Loop Back to hub