Bayesian A/B Testing Without a Data Team
Beta-Binomial explained for builders: how to read results, choose confidence thresholds, and make decisions with small samples when the effect size is large enough. No statistics degree required.
TL;DR
- ▸Bayesian A/B testing gives you a probability statement: "there is a 99.8% chance that treatment beats control." It is more intuitive than p-values and actionable at smaller sample sizes.
- ▸The Beta-Binomial model is the right tool for binary conversion metrics (clicked / did not click). It requires only four numbers: n and k for each variant.
- ▸A large effect size can make a small sample decisive. FeatBit's hero experiment had 60 per variant — below the 200 minimum — but the 3× conversion difference was robust enough to act on.
- ▸Set your confidence threshold before the experiment. P ≥ 95% for standard decisions. P ≥ 99% for high-stakes or irreversible changes.
Bayesian vs Frequentist
Traditional A/B testing uses frequentist statistics: you collect data, compute a p-value, and reject the null hypothesis if p < 0.05. The p-value answers: "If there were no difference, how likely would we be to see data this extreme?" That framing is opaque to most builders.
Bayesian testing answers a more useful question: "Given the data we have, what is the probability that treatment is better than control?" That number — 72%, 95%, 99.8% — is directly interpretable. A product manager, an engineer, or a founder can act on it.
| Frequentist | Bayesian | |
|---|---|---|
| Output | p-value (probability of data given null) | P(treatment > control) |
| Interpretability | Requires statistical training | Directly interpretable |
| Early stopping | Inflates false positive rate | Naturally handles peeking |
| Small samples | Requires minimum sample for validity | Works with any n, with appropriate uncertainty |
| Decision framing | Reject or fail to reject null | CONTINUE / PAUSE / ROLLBACK / INCONCLUSIVE |
The Beta-Binomial Model
For binary conversion metrics — clicked or didn't click, signed up or didn't — the Beta-Binomial is the standard Bayesian model. Here is how it works in plain terms:
- 1.Collect four numbers: For each variant, you need n (total sessions exposed) and k (sessions that converted). For FeatBit's hero experiment: control n=62, k=6; treatment n=60, k=18.
- 2.Model each variant as a Beta distribution: Beta(k+1, n-k+1) gives a probability distribution over the true conversion rate. The peak is near the observed rate, and the width reflects uncertainty (narrower with more data).
- 3.Run Monte Carlo sampling: Draw 100,000 samples from each distribution. Count the fraction of draws where treatment > control. That fraction is P(treatment wins).
- 4.Interpret: P(treatment wins) = 99.8% means that across all plausible true conversion rates consistent with the data, treatment beats control 998 times out of 1,000.
How to Read Results
A Bayesian analysis produces three numbers worth reading:
Probability that treatment has a higher true conversion rate than control. Your primary decision signal.
How much higher the treatment rate is, relative to control. Indicative of magnitude, not precise at small samples.
The observed conversion rates for each variant. The baseline rate matters for setting future experiment targets.
Sample Size and Effect Size
Sample size minimum rules exist to ensure that a result is precise, not just directionally correct. But effect size and sample size trade off: a 3× conversion difference (like FeatBit's hero experiment) is far more robust at n=60 than a 3% improvement at n=60.
The practical rule: when P(treatment wins) is above 99% and the effect size is large (relative lift > 50%), the direction is reliable even below the minimum sample. Treat the exact magnitude as indicative. Do not use a small-sample result to justify a +210% revenue projection, but do use it to make a CONTINUE decision with confidence.
Sample size guidance
Confidence Thresholds
The threshold for action must be set before the experiment starts — not after you see the data. Two common thresholds:
UI changes, content experiments, feature rollouts with easy rollback. Acceptable for most product experiments where the downside is recoverable.
Pricing changes, checkout flow changes, infrastructure modifications with costly rollback. Higher threshold for changes where being wrong is expensive.
FAQ
Can I use Bayesian testing for continuous metrics (like revenue per user)?
Yes, but the model changes. For continuous metrics, use a Normal-Normal or Gamma model instead of Beta-Binomial. The interpretation is the same — P(treatment > control) — but the calculation is different. For most product experiments, conversion metrics are simpler and more reliable than revenue metrics at small sample sizes.
What does it mean if P(treatment wins) is 72%?
It means the data is consistent with treatment winning, but not decisively. 72% is not sufficient to act — it would be wrong 28% of the time. The right frame is INCONCLUSIVE: wait for more data, or consider whether the hypothesis is testing the right thing.
Why not just use a p-value like everyone else?
You can. But p-values are frequently misinterpreted. A p-value of 0.04 does not mean 'there is a 96% chance treatment is better' — it means 'if there were no effect, we would see data this extreme 4% of the time.' Bayesian P(treatment > control) is what most people think p-values mean. Use the interpretation that your team will actually act on correctly.