Release Decision Agent/The Harness

Release Decision Harness: Democratizing Experimentation for the AI Era

AI coding agents have made writing features 10x faster.^[7] What they have not changed is whether those features have market value. A company shipping twice as fast is not twice as successful — it is twice as exposed to the cost of building things that don't move revenue, adoption, or retention. Release decision is the discipline of answering that question before you continue investing. The harness makes that discipline simple enough for any development team to run — without a data scientist, a dedicated experimentation platform, or a context switch out of the IDE.

9 min read·Updated March 2026

VisualReading

TL;DR

▸Release decision is experimentation for the AI era — evaluating whether AI-accelerated code delivers real business value before you double down. If the feature doesn't move revenue, adoption, or retention, there's no reason to keep investing. The harness runs the full loop from measurable intent to confirmed (or refuted) hypothesis, inside the IDE, without a data team.
▸featbit-release-decision is the hub skill — it reads context, identifies which control stage applies, and routes to the right satellite skill (intent-shaping, hypothesis-design, reversible-exposure-control, measurement-design, experiment-workspace, evidence-analysis, learning-capture).
▸The shared state file (.featbit-release-decision/intent.md) persists the decision context across all eight stages — so the agent never loses the thread between sessions or tool calls.
▸The harness is model-agnostic and IDE-agnostic: it works in VS Code Copilot, Claude Code, Cursor, or any agent that can load SKILL.md context files.

What Is an LLM Agent Harness?

The term harness comes from test engineering — a test harness is the infrastructure that wraps a component under test: drivers, stubs, fixtures. Alone the component cannot run. The harness provides the scaffolding that makes it runnable, observable, and controllable.

Coding agents face exactly this problem when asked to validate business value. An LLM is capable — it can reason about experiment design, draft rollout plans, and interpret statistical output. But capability without structure means the agent might skip the hypothesis stage when a user seems impatient, pick the metric that looks good rather than the one defined before the experiment, or let urgency substitute for evidence. Knowing whether a feature is worth continuing requires more than capability. It requires discipline, enforced at every stage.

The release decision harness provides that enforcement through four structural properties:

Persistent state

A structured file that survives across turns, sessions, and context resets.

Trigger routing

Logic that identifies what stage the user is in and activates the right skill.

Control gates

Rules that prevent skipping stages — no implementation before a hypothesis, no decision before sufficient data.

Skill decomposition

Domain-specific skills that own a single responsibility and can be loaded, swapped, or audited independently.

The release decision harness is not a UI, not a SaaS platform, and not a fixed workflow script. It is a set of SKILL.md files that encode the control framework — loaded as context by the agent, activated by user intent, and coordinated through shared state. Traditional experimentation platforms require a dedicated team, a product UI, and statistical expertise most developers don't have. The harness brings rigorous experimentation into the coding agent workflow — accessible to any team, without a data scientist or a context switch out of the IDE.^[8]

Architecture: Hub + Satellites

The harness has one hub skill and seven satellite skills. The hub (featbit-release-decision) is always active — it reads the current decision state and determines which control lens applies. Satellite skills are activated by the hub when the user's message or workspace state matches a trigger.

// hub: control routing + CF-01 through CF-08

featbit-release-decision

├── intent-shaping (CF-01)

├── hypothesis-design (CF-02)

├── reversible-exposure-control (CF-03 / CF-04)

├── measurement-design (CF-05)

├── experiment-workspace (CF-05 after)

├── evidence-analysis (CF-06 / CF-07)

└── learning-capture (CF-08)

// shared state — persists across all skills and sessions

.featbit-release-decision/intent.md

Skill	CF	Responsibility
featbit-release-decision	hub	Control routing, session state, philosophy enforcement
intent-shaping	CF-01	Extract measurable goal; block tactic-first starts
hypothesis-design	CF-02	Produce falsifiable 5-component hypothesis
reversible-exposure-control	CF-03/04	Flag spec, targeting, rollout logic, handoff
measurement-design	CF-05	One primary metric, 2–3 guardrails, event schema
experiment-workspace	CF-05+	File-based experiment tracking and Bayesian analysis
evidence-analysis	CF-06/07	Sufficiency check + CONTINUE/PAUSE/ROLLBACK framing
learning-capture	CF-08	5-component learning + next hypothesis seed

Walking the Full Loop

Every experiment that moves through the harness touches all seven satellite skills in sequence. Each skill produces a concrete artifact written to the shared state. Here is what the harness does at each stage.

CF-01intent-shapingIntent

What business outcome are we actually trying to change?

The harness enters here. Before any code is written or flag created, intent-shaping extracts the real desired outcome. 'We should add a better CTA' becomes 'increase the rate at which homepage visitors start a free trial by identifying and removing friction in the first touchpoint.' The skill refuses to proceed until the outcome is measurable and specific.

Artifact:Measurable goal written to intent.md

CF-02hypothesis-designHypothesis

What change will move that outcome, and why?

hypothesis-design forces a falsifiable statement before any implementation begins: 'We believe [change X] will [move metric Y] for [audience A] because [causal reason R].' All five components are required. A hypothesis missing the causal reason is just a guess — and a guess cannot tell you why the next iteration should be different.

Artifact:Falsifiable causal claim

CF-03 / CF-04reversible-exposure-controlReversibility + Exposure

How do we make this change reversible and who sees it first?

reversible-exposure-control handles two control principles together: every change must be made reversible before it becomes visible, and exposure is a deliberate decision — not a deployment side effect. The skill produces a concrete flag contract (key, variants, targeting rules, rollout percentage, rollback triggers) that can be handed off to the team that owns flag operations.

Artifact:Feature flag spec + rollout config

CF-05measurement-designMeasurement

What is the one metric that decides success?

measurement-design enforces one north-star metric per experiment, defined before the flag is ever enabled. It also defines 2–3 guardrail metrics and the event schema required to collect them. If instrumentation does not exist for the desired metric, this skill halts the loop until it is built — preventing the most common form of experiment failure: shipping before you can measure.

Artifact:Primary metric + guardrails + event schema

CF-05 (after)experiment-workspaceData Collection

How do we track the experiment as a shared, auditable artifact?

experiment-workspace replaces what an online experiment dashboard does, using flat files that any team member can read, commit to git, and reason about. A Python script collects data from FeatBit's insight API and another runs Bayesian analysis. The experiment lives in .featbit-release-decision/experiments/<slug>/ — visible, auditable, and offline-first.

Artifact:Local experiment folder with definition, data, and analysis files

CF-06 / CF-07evidence-analysisEvidence + Decision

Is the data sufficient? CONTINUE, PAUSE, or ROLLBACK?

evidence-analysis handles two control principles together: first check whether the data is sufficient to decide at all (simultaneous windows, adequate sample, clean measurement), then frame the outcome into exactly one of four categories — CONTINUE, PAUSE, ROLLBACK CANDIDATE, or INCONCLUSIVE. Urgency is never allowed to substitute for evidence.

Artifact:Structured decision record

CF-08learning-captureLearning

What did we learn, and what should we test next?

learning-capture closes the loop. A cycle is not finished until five things are written: what changed, what happened (with numbers), confirmed or refuted, why it likely happened, and the next hypothesis. The learning is committed to intent.md so the next iteration starts from evidence — not from memory drift or gut feeling.

Artifact:Learning artifact + next hypothesis seed

The control framework above is grounded in hypothesis-driven development ^[2], rigorous controlled experimentation methodology ^[1]^[4], and Bayesian inference applied to product decisions ^[5]. The business case for institutionalizing this discipline at scale is made in ^[3].

Harness vs. Agent: Why the Distinction Matters

“Agent” describes capability — a system that can perceive, plan, and act. “Harness” describes structure — the scaffold that keeps a capable system on track. The two are not mutually exclusive, but conflating them leads to the most common failure mode in AI tooling: shipping a capable agent with no structural constraints, then wondering why it skips the hard parts.

Dimension	Plain agent	Agent + harness
State across turns	In-context only — lost on context reset	Persisted to intent.md — survives sessions
Stage enforcement	LLM may skip or collapse stages under pressure	Control gates block progression without artifacts
Skill scope	Single prompt handles everything	Each satellite owns one responsibility only
Auditability	Hard to reconstruct why a decision was made	Artifacts at each stage form a decision trail
Replaceability	Monolithic — swap one thing, break everything	Skills are independent — swap or upgrade individually

The release decision harness is intentionally narrow. It does not try to automate the whole engineering workflow. It owns exactly one problem: keeping the release decision loop intellectually honest, from intent to learning, with a persistent evidence trail. That is the value of a harness — not capability breadth, but disciplined, auditable depth in a specific domain.

Why This Needed LLMs to Exist

The release decision loop is not a new idea. Product teams have understood hypothesis-driven development, progressive rollouts, and evidence-based decisions for years. What prevented a harness from existing was the last-mile UX problem: the work of running hypothesis design, writing measurement plans, and interpreting Bayesian output is inherently linguistic and contextual — it cannot be reduced to form fields.

Reasoning over ambiguous context

A form asking 'what is your hypothesis?' produces a filled field. An LLM-powered skill asks why the mechanism is causal, flags missing components, and refuses to proceed until the claim is falsifiable. That difference is not a UX improvement — it is a qualitatively different capability.

Routing across heterogeneous skills

The hub skill reads natural language context — a sentence like 'I think we should start rolling this out' — and correctly identifies that CF-04 applies, not CF-07. Rules-based routing cannot handle that ambiguity. LLM reasoning can.

Translating evidence into language

Evidence analysis produces a P(win) number and a risk value. Converting that into a structured business decision — CONTINUE, PAUSE, ROLLBACK CANDIDATE — requires understanding the hypothesis, the primary metric, and the guardrails in context. That translation is what the LLM does inside the harness.

Persistent, structured memory

intent.md is a natural language document with structured fields. The LLM reads it, updates the right fields, and maintains coherence across an experiment that might span two weeks of intermittent sessions. A traditional workflow tool would require a database and a UI. The harness needs a file and an LLM.

Traditional experimentation platforms — Optimizely, Statsig, Amplitude — are data platforms with a web UI.^[3] They require a PM to own the workflow, a data engineer to connect the warehouse, and an analyst to interpret results. The harness moves that entire loop into the coding agent, where the developer already is, without requiring a second product or a dedicated team.^[4]^[7]

FAQ

Is the harness specific to FeatBit feature flags?

The control framework (CF-01 through CF-08) is not FeatBit-specific. The reversible-exposure-control and experiment-workspace skills have FeatBit adapters — CLI commands, REST API calls, SDK examples — but the harness will produce a valid handoff spec regardless of which flag system you use. FeatBit is the recommended control plane, not a hard dependency.

How does the harness handle a user who skips a stage?

The hub skill detects the missing artifact in intent.md and routes back. If a user says 'let's start rolling this out' but the hypothesis field is empty, the hub triggers hypothesis-design before allowing reversible-exposure-control to proceed. This is the core function of the harness — it enforces the loop structure without requiring the user to manually track where they are.

What happens to context between coding sessions?

intent.md holds the current decision state: goal, hypothesis, change, stage, primary metric, guardrails, and the last learning. When the agent is reloaded in a new session, reading this file restores the full context. No data is lost between sessions — the harness does not depend on in-context memory.

Which LLMs and coding agents are supported?

The harness is SKILL.md-based and model-agnostic. It works in VS Code GitHub Copilot, Claude Code, Cursor, and any coding agent that supports loading context files as skills. The quality of routing improves with stronger reasoning models, but the structure works across all major frontier models.

Why use a file-based experiment workspace instead of an online dashboard?

Online dashboards require accounts, browser access, and a platform that owns your data. The file-based workspace keeps experiments in the repository — reviewable in a PR, auditable in git history, and accessible offline. The tradeoff is that you run Python scripts instead of clicking a UI. For developer teams already living in the terminal, that is a feature.

References

[1]Kohavi, R., Tang, D., Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.

[2]Ries, E. (2011). The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Business.

[3]Thomke, S. H. (2020). Experimentation Works: The Surprising Power of Business Experiments. Harvard Business Review Press.

[4]Sweet, D. (2023). Experimentation for Engineers: From A/B Testing to Bayesian Methods. Manning Publications.

[5]Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.

[6]Forsgren, N., Humble, J., Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press.

[7]McCain, M., Millar, T., Huang, S. et al. (2026). Measuring AI agent autonomy in practice. Anthropic. Feb 18, 2026.

[8]Rajasekaran, P. (2026). Harness design for long-running application development. Anthropic. Mar 24, 2026.

Continue reading

The Release Decision Loop: all 8 stages See the harness in action: 210% case study Feature Flags as Release Infrastructure Back to hub