Release Decision Harness: Democratizing Experimentation for the AI Era
AI coding agents have made writing features 10x faster.[7] What they have not changed is whether those features have market value. A company shipping twice as fast is not twice as successful — it is twice as exposed to the cost of building things that don't move revenue, adoption, or retention. Release decision is the discipline of answering that question before you continue investing. The harness makes that discipline simple enough for any development team to run — without a data scientist, a dedicated experimentation platform, or a context switch out of the IDE.
TL;DR
- ▸Release decision is experimentation for the AI era — evaluating whether AI-accelerated code delivers real business value before you double down. If the feature doesn't move revenue, adoption, or retention, there's no reason to keep investing. The harness runs the full loop from measurable intent to confirmed (or refuted) hypothesis, inside the IDE, without a data team.
- ▸featbit-release-decision is the hub skill — it reads context, identifies which control stage applies, and routes to the right satellite skill (intent-shaping, hypothesis-design, reversible-exposure-control, measurement-design, experiment-workspace, evidence-analysis, learning-capture).
- ▸The shared state file (
.featbit-release-decision/intent.md) persists the decision context across all eight stages — so the agent never loses the thread between sessions or tool calls. - ▸The harness is model-agnostic and IDE-agnostic: it works in VS Code Copilot, Claude Code, Cursor, or any agent that can load SKILL.md context files.
What Is an LLM Agent Harness?
The term harness comes from test engineering — a test harness is the infrastructure that wraps a component under test: drivers, stubs, fixtures. Alone the component cannot run. The harness provides the scaffolding that makes it runnable, observable, and controllable.
Coding agents face exactly this problem when asked to validate business value. An LLM is capable — it can reason about experiment design, draft rollout plans, and interpret statistical output. But capability without structure means the agent might skip the hypothesis stage when a user seems impatient, pick the metric that looks good rather than the one defined before the experiment, or let urgency substitute for evidence. Knowing whether a feature is worth continuing requires more than capability. It requires discipline, enforced at every stage.
The release decision harness provides that enforcement through four structural properties:
The release decision harness is not a UI, not a SaaS platform, and not a fixed workflow script. It is a set of SKILL.md files that encode the control framework — loaded as context by the agent, activated by user intent, and coordinated through shared state. Traditional experimentation platforms require a dedicated team, a product UI, and statistical expertise most developers don't have. The harness brings rigorous experimentation into the coding agent workflow — accessible to any team, without a data scientist or a context switch out of the IDE.[8]
Architecture: Hub + Satellites
The harness has one hub skill and seven satellite skills. The hub (featbit-release-decision) is always active — it reads the current decision state and determines which control lens applies. Satellite skills are activated by the hub when the user's message or workspace state matches a trigger.
| Skill | CF | Responsibility |
|---|---|---|
| featbit-release-decision | hub | Control routing, session state, philosophy enforcement |
| intent-shaping | CF-01 | Extract measurable goal; block tactic-first starts |
| hypothesis-design | CF-02 | Produce falsifiable 5-component hypothesis |
| reversible-exposure-control | CF-03/04 | Flag spec, targeting, rollout logic, handoff |
| measurement-design | CF-05 | One primary metric, 2–3 guardrails, event schema |
| experiment-workspace | CF-05+ | File-based experiment tracking and Bayesian analysis |
| evidence-analysis | CF-06/07 | Sufficiency check + CONTINUE/PAUSE/ROLLBACK framing |
| learning-capture | CF-08 | 5-component learning + next hypothesis seed |
Walking the Full Loop
Every experiment that moves through the harness touches all seven satellite skills in sequence. Each skill produces a concrete artifact written to the shared state. Here is what the harness does at each stage.
The harness enters here. Before any code is written or flag created, intent-shaping extracts the real desired outcome. 'We should add a better CTA' becomes 'increase the rate at which homepage visitors start a free trial by identifying and removing friction in the first touchpoint.' The skill refuses to proceed until the outcome is measurable and specific.
hypothesis-design forces a falsifiable statement before any implementation begins: 'We believe [change X] will [move metric Y] for [audience A] because [causal reason R].' All five components are required. A hypothesis missing the causal reason is just a guess — and a guess cannot tell you why the next iteration should be different.
reversible-exposure-control handles two control principles together: every change must be made reversible before it becomes visible, and exposure is a deliberate decision — not a deployment side effect. The skill produces a concrete flag contract (key, variants, targeting rules, rollout percentage, rollback triggers) that can be handed off to the team that owns flag operations.
measurement-design enforces one north-star metric per experiment, defined before the flag is ever enabled. It also defines 2–3 guardrail metrics and the event schema required to collect them. If instrumentation does not exist for the desired metric, this skill halts the loop until it is built — preventing the most common form of experiment failure: shipping before you can measure.
experiment-workspace replaces what an online experiment dashboard does, using flat files that any team member can read, commit to git, and reason about. A Python script collects data from FeatBit's insight API and another runs Bayesian analysis. The experiment lives in .featbit-release-decision/experiments/<slug>/ — visible, auditable, and offline-first.
evidence-analysis handles two control principles together: first check whether the data is sufficient to decide at all (simultaneous windows, adequate sample, clean measurement), then frame the outcome into exactly one of four categories — CONTINUE, PAUSE, ROLLBACK CANDIDATE, or INCONCLUSIVE. Urgency is never allowed to substitute for evidence.
learning-capture closes the loop. A cycle is not finished until five things are written: what changed, what happened (with numbers), confirmed or refuted, why it likely happened, and the next hypothesis. The learning is committed to intent.md so the next iteration starts from evidence — not from memory drift or gut feeling.
The control framework above is grounded in hypothesis-driven development [2], rigorous controlled experimentation methodology [1][4], and Bayesian inference applied to product decisions [5]. The business case for institutionalizing this discipline at scale is made in [3].
Harness vs. Agent: Why the Distinction Matters
“Agent” describes capability — a system that can perceive, plan, and act. “Harness” describes structure — the scaffold that keeps a capable system on track. The two are not mutually exclusive, but conflating them leads to the most common failure mode in AI tooling: shipping a capable agent with no structural constraints, then wondering why it skips the hard parts.
| Dimension | Plain agent | Agent + harness |
|---|---|---|
| State across turns | In-context only — lost on context reset | Persisted to intent.md — survives sessions |
| Stage enforcement | LLM may skip or collapse stages under pressure | Control gates block progression without artifacts |
| Skill scope | Single prompt handles everything | Each satellite owns one responsibility only |
| Auditability | Hard to reconstruct why a decision was made | Artifacts at each stage form a decision trail |
| Replaceability | Monolithic — swap one thing, break everything | Skills are independent — swap or upgrade individually |
The release decision harness is intentionally narrow. It does not try to automate the whole engineering workflow. It owns exactly one problem: keeping the release decision loop intellectually honest, from intent to learning, with a persistent evidence trail. That is the value of a harness — not capability breadth, but disciplined, auditable depth in a specific domain.
Why This Needed LLMs to Exist
The release decision loop is not a new idea. Product teams have understood hypothesis-driven development, progressive rollouts, and evidence-based decisions for years. What prevented a harness from existing was the last-mile UX problem: the work of running hypothesis design, writing measurement plans, and interpreting Bayesian output is inherently linguistic and contextual — it cannot be reduced to form fields.
A form asking 'what is your hypothesis?' produces a filled field. An LLM-powered skill asks why the mechanism is causal, flags missing components, and refuses to proceed until the claim is falsifiable. That difference is not a UX improvement — it is a qualitatively different capability.
The hub skill reads natural language context — a sentence like 'I think we should start rolling this out' — and correctly identifies that CF-04 applies, not CF-07. Rules-based routing cannot handle that ambiguity. LLM reasoning can.
Evidence analysis produces a P(win) number and a risk value. Converting that into a structured business decision — CONTINUE, PAUSE, ROLLBACK CANDIDATE — requires understanding the hypothesis, the primary metric, and the guardrails in context. That translation is what the LLM does inside the harness.
intent.md is a natural language document with structured fields. The LLM reads it, updates the right fields, and maintains coherence across an experiment that might span two weeks of intermittent sessions. A traditional workflow tool would require a database and a UI. The harness needs a file and an LLM.
Traditional experimentation platforms — Optimizely, Statsig, Amplitude — are data platforms with a web UI.[3] They require a PM to own the workflow, a data engineer to connect the warehouse, and an analyst to interpret results. The harness moves that entire loop into the coding agent, where the developer already is, without requiring a second product or a dedicated team.[4][7]
FAQ
Is the harness specific to FeatBit feature flags?
The control framework (CF-01 through CF-08) is not FeatBit-specific. The reversible-exposure-control and experiment-workspace skills have FeatBit adapters — CLI commands, REST API calls, SDK examples — but the harness will produce a valid handoff spec regardless of which flag system you use. FeatBit is the recommended control plane, not a hard dependency.
How does the harness handle a user who skips a stage?
The hub skill detects the missing artifact in intent.md and routes back. If a user says 'let's start rolling this out' but the hypothesis field is empty, the hub triggers hypothesis-design before allowing reversible-exposure-control to proceed. This is the core function of the harness — it enforces the loop structure without requiring the user to manually track where they are.
What happens to context between coding sessions?
intent.md holds the current decision state: goal, hypothesis, change, stage, primary metric, guardrails, and the last learning. When the agent is reloaded in a new session, reading this file restores the full context. No data is lost between sessions — the harness does not depend on in-context memory.
Which LLMs and coding agents are supported?
The harness is SKILL.md-based and model-agnostic. It works in VS Code GitHub Copilot, Claude Code, Cursor, and any coding agent that supports loading context files as skills. The quality of routing improves with stronger reasoning models, but the structure works across all major frontier models.
Why use a file-based experiment workspace instead of an online dashboard?
Online dashboards require accounts, browser access, and a platform that owns your data. The file-based workspace keeps experiments in the repository — reviewable in a PR, auditable in git history, and accessible offline. The tradeoff is that you run Python scripts instead of clicking a UI. For developer teams already living in the terminal, that is a feature.