Release Decision Engine/Hypothesis First

Write the Hypothesis Before You Build

The single discipline that separates evidence-backed releases from ship-and-pray. A reusable template, worked examples, and the failure modes you will avoid by writing the hypothesis first.

7 min read·Updated March 2026

VisualReading

TL;DR

▸A hypothesis is a falsifiable causal claim: "We believe X will cause Y among Z because W." It has a success criterion written before the experiment starts.
▸Without a pre-written hypothesis, you interpret results post-hoc — seeing patterns that confirm what you already believed.
▸The template has five parts: change, metric, audience, causal reason, and success criteria. All five are required.
▸A refuted hypothesis is not a failure. It is a learning that changes what you build next.

Why Hypothesis First

When you ship without a pre-written hypothesis, you evaluate results with hindsight. If the metric goes up, the change was a success. If it goes down, external factors were to blame. If it stays flat, you "need more data." None of these conclusions are reliable — they are post-hoc rationalizations.

Writing the hypothesis before building forces two valuable constraints: (1) you must specify what success looks like before you see the data, and (2) you must articulate why you believe the change will work. The "why" is where the real learning lives — because when the hypothesis is refuted, you know which causal assumption was wrong.

In FeatBit's own hero experiment, writing the hypothesis forced the realization that the north-star metric could not be selector clicks (curiosity) — it had to be enterprise contact clicks (intent). That realization changed the measurement design before any code was written.

The Template

// Hypothesis template

We believe [change] will increase/decrease [metric] among [audience], because [causal reason]. We will know this worked if [success criteria].

Change

The specific UI, logic, or behavior change being tested. Not 'improve onboarding' — 'add a deployment-method selector below the hero CTAs.'

Metric

The one measurable outcome that determines success. Trackable, relevant to the business intent, and defined before the experiment starts.

Audience

The population being exposed to the change. Be specific: not 'users' but 'homepage visitors with self-hosting intent.'

Causal reason

Why you believe the change will move the metric. This is your testable assumption. When the hypothesis is refuted, this is what was wrong.

Success criteria

The quantitative threshold that triggers CONTINUE. Usually P(treatment wins) ≥ 95% and positive primary metric direction.

Bad vs Good Hypotheses

Bad

"We think adding deploy buttons will help users understand our self-hosting options."

Bad: no metric, no audience, no causal reason, no success criteria.

Good

"We believe adding a deployment-method selector (Kubernetes / AWS / Docker Compose) below the hero CTAs will increase enterprise_contact_click rate among homepage visitors, because the current hero presents FeatBit as a cost-reduction tool without signaling production K8s and AWS readiness. We will know this worked if the treatment group shows P(treatment wins) ≥ 95%."

Bad

"We want to improve the onboarding experience."

Bad: not falsifiable, no change specified, no measurement.

Good

"We believe showing an interactive checklist on the first login session will increase day-7 retention among new signups from organic search, because first-session disorientation is the primary drop-off cause in our funnel analysis. We will know this worked if day-7 retention improves by ≥ 5 percentage points with P ≥ 95%."

Audience Definition

The audience in a hypothesis is not always the same as the exposure population. "All homepage visitors" can be the exposure — but the hypothesis may be specifically about visitors with self-hosting intent. You cannot always segment by intent in real time, but specifying the intended audience clarifies what the experiment is really testing and what a positive result means.

A result that says "treatment beat control for all homepage visitors" is less useful than one that says "treatment beat control specifically among visitors who proceeded to pricing — consistent with the hypothesis that the change serves self-hosting-intent visitors."

Success Criteria

Success criteria prevent the most common failure mode: changing the threshold after seeing data. If you didn't write "P ≥ 95%" before the experiment, you will be tempted to declare success at P = 87% when the number looks good. Pre-committing to the threshold removes that temptation.

Typical success criteria

▸P(treatment wins) ≥ 95% on the primary metric
▸No statistically significant degradation in guardrail metrics
▸Minimum sample size reached (e.g. 200 per variant) before concluding
▸Observation window elapsed (prevents stopping too early)

FAQ

What if we don't know the causal reason?

If you genuinely don't know why the change would work, that's a signal that the hypothesis needs more thought — not that you should skip the causal reason field. Write your best guess. It will be the most valuable part of the learning when the hypothesis is refuted.

Can a refuted hypothesis still be useful?

A refuted hypothesis is often more valuable than a confirmed one. It eliminates a causal assumption, preventing you from chasing the wrong strategy for months. The learning from a refuted hypothesis ('our audience doesn't care about this signal') is directly actionable for the next iteration.

What if the metric is hard to measure?

That is a measurement design problem, not a hypothesis problem. The hypothesis should still specify the ideal metric. If that metric is impossible to track, the measurement design step will either find a proxy metric or identify the instrumentation gap to fix before running the experiment.

Continue reading

The Release Decision Loop What to Measure 210% case study: hypothesis in practice Back to hub