Release Decision Engine/Measurement Design

What to Measure When You Can't Measure Everything

One north-star metric per experiment, defined before coding begins. How to navigate attribution gaps, design trackable events, and choose proxy metrics that connect to the business outcome.

8 min read·Updated March 2026

VisualReading

TL;DR

▸One north-star metric per experiment. Defined before the flag is enabled. Multiple primary metrics inflate false positive rates and lead to cherry-picking.
▸The attribution gap is the most common measurement failure: the variant is assigned on page A, but the conversion fires on page B. Solve it with sessionStorage bridging or a shared user identifier.
▸Guardrail metrics are not success criteria — they are harm detectors. Define them before the experiment to catch regressions you weren't looking for.
▸Closer to revenue beats further from revenue. Selector clicks are curiosity. Contact Sales clicks are intent. Choose the metric closest to the business outcome that you can reliably track.

One North-Star Metric

The most common measurement mistake is tracking everything and deciding based on whichever metric looks best. This is not measurement — it is post-hoc rationalization with extra steps.

One primary metric forces the team to agree, before the experiment, on what success looks like. That agreement is valuable beyond the statistics: it forces alignment on what the experiment is actually testing, and prevents debates after the data comes in.

Criteria for a good north-star metric

▸Directly connected to the business outcome from CF-01 (Intent)
▸Trackable end-to-end — the event fires reliably and the variant is attributable
▸Sensitive enough to move within the observation window given expected traffic
▸Unambiguous — there is one definition of success and everyone agrees on it
▸Not gameable — it cannot be inflated by changes unrelated to the hypothesis

Guardrail Metrics

Guardrail metrics are not primary outcomes — they are harm detectors. They answer: "while we were optimizing for X, did we accidentally hurt Y?" Common guardrails include page load time, bounce rate, open-source signup rate, and customer support ticket volume.

Define guardrails before the experiment starts. If you discover them after seeing the data, you will be tempted to dismiss degradations as noise. Pre-committed guardrails have teeth.

Guardrail triggers PAUSE, not automatic ROLLBACK

A guardrail degradation means: stop expanding, investigate. It does not automatically mean rollback. The degradation may be noise at small sample sizes, or it may reveal a real trade-off worth understanding before deciding. PAUSE + investigate is the right default.

The Attribution Gap

The attribution gap occurs when the variant assignment happens on one page and the conversion event fires on a different page. Without bridging the variant across the navigation, the analytics pipeline sees the conversion but cannot attribute it to a variant.

FeatBit's hero experiment had this exact problem: the deployment selector is on the homepage, but enterprise_contact_click fires on the pricing page. The solution was sessionStorage bridging.

sessionStorage bridging pattern

// On homepage load (variant assignment)

sessionStorage.setItem('hero_variant', flagValue)

// On pricing page, contact click handler

const variant = sessionStorage.getItem('hero_variant')

trackEvent('enterprise_contact_click', { variant })

sessionStorage persists across page navigations within the same tab, and clears when the tab is closed. This is the right scope for a single-session conversion funnel.

Event Design

Well-designed events are the foundation of reliable measurement. Four principles:

Name events by action, not by element

enterprise_contact_click is better than pricing_page_button_3_click. The name should survive a redesign that moves the button.

Include the variant as an event property

Every event tied to an experiment should carry properties: { variant: 'true' | 'false' }. This enables per-variant filtering in analytics without joining on a separate experiment assignment table.

Fire once per distinct user per session

Counting repeat clicks from the same user inflates k (conversion count) and makes the denominator hard to interpret. Track distinct users, not total clicks.

Verify instrumentation before the experiment starts

Run a 100% / 0% split for 24 hours and confirm events fire for both variants as expected. Finding an instrumentation bug after 14 days of data collection is very expensive.

Proxy Metrics

The ideal metric is often not directly trackable. Revenue per user requires a long observation window. Customer lifetime value requires months. In these cases, use a proxy metric — one that is reliably correlated with the ideal metric and measurable within the experiment window.

Ideal metric	Good proxy	Why it works
Paid self-hosting customers	enterprise_contact_click	Contact Sales is the funnel entry for enterprise evaluation
30-day retention	Day-7 retention	Day-7 and Day-30 are highly correlated for SaaS products
Revenue per user	Plan upgrade click	Upgrade intent is a leading indicator of paid conversion
Feature adoption	Feature first-use within 7 days	First-use predicts ongoing usage better than activation alone

FAQ

Can I add metrics after the experiment starts?

No — adding metrics mid-experiment inflates false positive rates. If you think of a new metric during the experiment, write it down as a candidate for the next iteration. Do not add it to the current analysis.

What if my north-star metric doesn't move?

A flat result on the primary metric is valid data. It means either (a) the hypothesis was wrong — the change did not affect the outcome, or (b) the metric was not sensitive enough to detect the effect at the current sample size. The learning should distinguish between these two cases.

Should I track the deployment selector click events separately?

Yes, as engagement signals — not as primary metrics. Self_host_click_kubernetes, self_host_click_aws, and self_host_click_docker_compose are useful for understanding which deployment path visitors explore. But they measure curiosity, not enterprise evaluation intent. Keep them as secondary signals.

Continue reading

Bayesian A/B Testing Attribution gap solved: 210% case study Write the Hypothesis First Back to hub