How to Set Up Online Evals for Live AI Outputs

June 7, 2026

Online evals are the production evaluation loop that watches AI outputs after real inputs reach the system. The practical job is not only to calculate a score. It is to turn live outputs into repeatable release evidence: sample the right traffic, score the output, join every score to the runtime variation that produced it, and define the action before a guardrail fails.

This tutorial is for AI product and platform teams that already have an AI behavior in production or near production. If you need the term definition first, start with what an online eval flag is. This article is the implementation companion: how to set up a continuous loop for live AI outputs.

Continuous online eval loop for live AI outputs showing flag evaluation, AI response, sampling, scoring, metric events, guardrail review, and release action

What The Online Eval Loop Must Prove

A useful online eval should prove five things:

Proof	Why it matters
The output came from a known runtime variation	Without attribution, a score cannot support a release decision.
The sampled traffic matches the decision scope	A support-chat eval should not silently mix internal tests, beta traffic, and high-risk accounts.
The scoring method fits the task	A deterministic assertion, model judge, human review, and product metric answer different questions.
The eval result joins to product and operational outcomes	Quality can improve while latency, cost, fallback rate, or escalation gets worse.
The team knows which action each result can trigger	Continue, pause, rollback candidate, expand, or clean up should be written before the dashboard is read.

That is the difference between monitoring and online evaluation. Monitoring tells you what happened. Online evaluation connects what happened to a controlled AI behavior and a release decision.

Step 1: Name The Live Behavior As A Variation

Start by naming the smallest runtime behavior that can be evaluated cleanly. For AI systems, this may be a prompt version, model route, retrieval profile, agent tool policy, guardrail setting, or a combined route.

Use a multivariate or structured flag when the route includes several fields:

{
  "flagKey": "support_assistant_answer_route",
  "variations": {
    "control": {
      "prompt": "support_prompt_v3",
      "modelRoute": "stable_model",
      "retrievalProfile": "baseline_search"
    },
    "candidate": {
      "prompt": "support_prompt_v4",
      "modelRoute": "candidate_model",
      "retrievalProfile": "reranker_v2"
    }
  }
}

The point is honesty. If the prompt and retrieval profile change together, call the variation a route. Do not later claim the online eval proved only the prompt won.

OpenFeature's evaluation context specification gives vendor-neutral language for the context used during flag evaluation, including targeting keys and custom fields. In practice, the same identity fields should appear in flag evaluation, exposure events, eval scores, product outcomes, and rollback reports.

Step 2: Choose The Sample Boundary

Online evals should not sample everything by default. Sampling too broadly can create noisy evidence, unnecessary storage, and avoidable privacy or governance review. Sampling too narrowly can hide the cases that matter.

Define the sample boundary before traffic starts:

Boundary	Example decision
Surface	English support chat, onboarding assistant, agent refund workflow
Assignment unit	account ID, user ID, conversation ID, workflow ID
Eligible traffic	production, paid accounts, beta segment, internal users
Exclusions	active incidents, regulated workflows, high-risk accounts, opt-out users
Sampling rate	100 percent during internal test, 10 percent during canary, lower rate after full rollout
Retention	store only fields needed for scoring, debugging, and release evidence

The sample boundary is a product and governance decision, not only a data pipeline setting. If the output may include sensitive customer text, define what is stored, masked, reviewed, or excluded before the eval runs.

Step 3: Emit Exposure When The AI Output Actually Runs

The exposure event should fire when the AI behavior actually produces an output, not when a page loads or a user becomes eligible. Eligibility is not exposure.

{
  "event": "ai_output_exposed",
  "flagKey": "support_assistant_answer_route",
  "unitId": "account_1842",
  "conversationId": "conv_772",
  "variation": "candidate",
  "surface": "support_chat",
  "sampledForOnlineEval": true,
  "timestamp": "2026-06-07T10:15:30Z"
}

Add the fields that will later explain the result, but avoid storing secrets, credentials, or more customer content than the eval requires. The minimum useful fields are the flag key, variation, assignment unit, output ID, surface, and timestamp.

FeatBit implementation primitives for this part include targeting rules, percentage rollouts, and the Track Insights API. The flag controls the route. The event keeps the route visible to the evidence loop.

Step 4: Score Outputs With The Right Evaluator

An online eval can use several scoring methods. The best choice depends on the task and risk.

Evaluator	Good fit	Watch out for
Deterministic checks	JSON validity, required fields, policy labels, citation presence	They catch format failures, not usefulness.
Model judge	Relevance, completeness, tone, groundedness, pairwise comparison	Calibrate against human review and known examples.
Human review	High-risk answers, domain judgment, policy-sensitive workflows	Slower and more expensive, so use sampling and triage.
Product outcome	task completion, escalation, conversion, accepted answer	Needs stable assignment and enough time to observe.
Operational metric	latency, cost, fallback rate, provider errors	A healthy system metric does not prove user value.

OpenAI's evaluation best practices are useful context for matching evaluation methods to the task, including model-grader and LLM-as-judge patterns. LaunchDarkly's AgentControl documentation describes online evaluations and judges as automated quality checks for production AI outputs. Statsig's AI Evals documentation similarly distinguishes online evals as production grading on real-world use cases.

Use those category signals carefully. A model judge can scale review, but it is still a measurement instrument. For important workflows, compare judge scores with human labels, keep examples of disagreement, and avoid letting the judge score become the only release decision.

Step 5: Join Eval Scores To Outcomes

The eval score should use the same join keys as the exposure event. Otherwise the team may know that a bad answer happened, but not which runtime variation produced it.

Online eval event contract showing shared keys across exposure, judge score, human review, product outcome, guardrail event, and release action

{
  "event": "ai_output_eval_score",
  "flagKey": "support_assistant_answer_route",
  "unitId": "account_1842",
  "conversationId": "conv_772",
  "outputId": "out_9231",
  "variation": "candidate",
  "judge": "support_answer_quality_v2",
  "scores": {
    "groundedness": 0.91,
    "completeness": 0.84,
    "tone": 0.95
  },
  "label": "pass",
  "timestamp": "2026-06-07T10:16:05Z"
}

Then join product outcomes and guardrails to the same variation:

{
  "event": "support_case_outcome",
  "flagKey": "support_assistant_answer_route",
  "unitId": "account_1842",
  "conversationId": "conv_772",
  "variation": "candidate",
  "resolvedWithoutEscalation": true,
  "humanCorrection": false,
  "latencyMs": 1820,
  "estimatedCostUsd": 0.011
}

Optimizely's Feature Experimentation documentation is useful category context here because it separates primary metrics from secondary and monitoring metrics. Use the same discipline for online evals: one metric or score should answer the release question, while guardrails decide whether expansion must stop.

Step 6: Convert Scores Into Release Actions

An online eval loop is incomplete until scores map to actions. Write the action rule before the rollout expands.

online_eval_decision_rule:
  release_question: should_support_assistant_candidate_expand
  primary_eval:
    metric: grounded_answer_pass_rate
    action_if_healthy: continue_or_expand
  guardrails:
    - metric: severe_human_review_failure
      action: rollback_candidate
    - metric: p95_latency
      action: pause_expansion
    - metric: fallback_rate
      action: investigate_before_expansion
    - metric: cost_per_resolved_case
      action: hold_for_business_review
  cleanup:
    winner: promote_route_and_remove_losing_branch
    loser: remove_candidate_route_after_postmortem

Avoid universal thresholds copied from another company. Use your baseline, risk tolerance, traffic volume, and customer impact. The reusable part is the decision shape: every score should have an owner, window, guardrail, and action.

FeatBit's measurement design guidance expands this release discipline. FeatBit's feature flag lifecycle management guidance helps keep temporary online eval controls from becoming stale branches after the decision.

Example: Support Assistant Online Eval

Imagine a team releasing a new support assistant route. The offline eval passed, but the team still needs to know whether live answers reduce escalation without increasing cost or complaints.

The tutorial setup would look like this:

Create a flag named support_assistant_answer_route with control and candidate variations.
Evaluate the flag by account ID so each account receives stable behavior during the decision window.
Start with internal support agents, then a small eligible customer segment.
Emit exposure only when the assistant returns an answer.
Sample candidate and control outputs for judge scoring and human review.
Track groundedness, completeness, human correction, escalation, latency, cost, fallback rate, and complaint signals.
Compare the primary outcome and guardrails before expanding.
Roll back to control if severe quality failure, missing telemetry, or guardrail breach appears.
Promote the winner and remove the losing route after the decision record is written.

Decision runbook for continuous online evals showing continue, pause, rollback candidate, expand, and cleanup states

This is a tutorial pattern, not a claim that every team should evaluate support assistants the same way. A coding agent, medical triage assistant, finance workflow, or internal search assistant needs different sampling rules, human review thresholds, and governance. The operating model stays the same: controlled variation, joinable evidence, guardrails, rollback, and cleanup.

Common Mistakes

Scoring outputs without variation attribution. If the eval result cannot identify the flag key and variation, it is weak release evidence.

Sampling only candidate traffic. Control outputs are often needed to know whether the candidate is truly better or whether the task became easier.

Letting the judge decide alone. Model judges are useful, but important release decisions should also consider human review, product outcomes, cost, latency, fallback, and segment risk.

Tracking exposure too early. Eligibility, page load, and AI output are different events. Track exposure when the AI behavior actually runs.

Changing the scorecard after seeing results. Decide the primary eval signal, guardrails, and action rules before rollout.

Keeping the eval flag forever. After the release decision, remove the losing branch or intentionally convert the flag into a long-lived operational control with an owner.

Setup Checklist

Before calling an online eval ready, confirm:

The flag variation names the live AI behavior being evaluated.
The assignment unit matches the user journey.
The sample boundary and exclusions are written down.
Exposure fires only when the AI output is produced.
Eval scores include flag key, variation, output ID, unit ID, and timestamp.
Human review or model-judge calibration exists for important quality labels.
Product outcomes and guardrails can join to exposure.
The primary eval signal and guardrails are defined before expansion.
Rollback returns traffic to the baseline without redeploying.
Cleanup rules exist for the winning and losing routes.

Bottom Line

Online evals are most valuable when they become a continuous release evidence loop, not a detached scoring job.

Use feature flags to control which live AI behavior runs. Emit exposure when that behavior produces an output. Score sampled outputs with the right evaluator. Join scores to product outcomes and guardrails. Then use the evidence to continue, pause, roll back, expand, or clean up.

For FeatBit teams, this is release-decision infrastructure applied to live AI outputs: targetable behavior, measurable evidence, reversible exposure, and a clear decision record.

Source Notes

Statsig's AI Evals overview is used as category context for online evals that grade model output in production on real-world use cases.
LaunchDarkly's online evaluations and judges documentation is used as category context for automated quality checks and judge-based scoring.
OpenAI's evaluation best practices are used for evaluation-method context, including model-grader and LLM-as-judge considerations.
OpenFeature's evaluation context specification is used for vendor-neutral language around targeting keys and context passed into flag evaluation.
Optimizely's Feature Experimentation metrics documentation is used for category context on primary, secondary, and monitoring metrics.
GrowthBook's feature flag product documentation is used as category context for feature flags, gradual rollout, and experiments. It is not used as a vendor ranking.
FeatBit implementation context: AI experimentation, measurement design, feature flag lifecycle management, targeting rules, percentage rollouts, Track Insights API, and flag insights support the release-control workflow described here.

Image And Open Graph Notes

Use cover.png as the Open Graph image because it summarizes online evals as a live-output evidence loop.
Use continuous-online-eval-loop.png near the opening because it shows the full workflow while keeping the guidance in crawlable Markdown.
Use online-eval-event-contract.png in the event schema section because it reinforces the join keys that make eval evidence trustworthy.
Use online-eval-decision-runbook.png in the example section because it connects eval results to release actions.

Keep reading on this topic

Experimentation

What Is an Online Eval Flag? A Practical Definition for AI Releases

A practical explainer for AI teams that need to evaluate live prompt, model, retrieval, or agent changes against quality and business outcomes.

Read article

Experimentation

How to Use an Online Eval Flag as a Release Switch

A practical tutorial for AI product and platform teams that need an online eval flag to control exposure, collect evidence, and roll back weak...

Read article

Experimentation

AI Evals: A Practical Guide to Evaluation Tools and Release Decisions

A product-team guide to understanding AI evals, comparing vendor terminology, and connecting offline scores to rollout, experiments, and release...

Read article