Walkthrough · Experimentation
Run an AB test
You want to test whether a UI change improves conversion before rolling it out to everyone.
- PM
- Growth
- Engineering
Skill cluster
The skills this walkthrough orchestrates.
Each skill in the catalog is a methodology unto itself. Walkthroughs show how multiple skills compose for a specific use case. Click a card to read the skill in detail.
Skill
experiment-design
Defines hypothesis, MDE, sample size, success criteria, decision rule.
Skill
feature-flagging
Configures variants, ramp plan, targeting rules, exclusion lists.
Skill
experimentation-platform-orchestrator
Picks the right platform for this test type given stack and maturity.
Skill
experimentation-analytics
Interprets results, validates assumptions, makes the ship/kill call.
Skill
product-analytics-setup
Provides the underlying event data the test depends on; validates instrumentation before launch.
Orchestration sequence
How the skills fire across 4 phases.
Each phase produces an artifact the next phase depends on. The sequence is what turns a high-level prompt into a shipped outcome.
- Phase 1
Setup
Translate the high-level prompt into a testable hypothesis. Verify that the analytics infrastructure can measure what the test claims to measure.
experiment-design
Drafts the experiment spec: hypothesis, MDE, sample size, guardrails, decision rule.
product-analytics-setup
Validates that the events the experiment depends on are correctly instrumented and firing.
- Phase 2
Configure
Pick the right experimentation platform for the test type. Configure variants, targeting, ramp plan, and exclusion rules.
experimentation-platform-orchestrator
Recommends the platform that fits the team's stack and the test's complexity.
feature-flagging
Implements the variant assignment, ramp plan, and rollback path.
- Phase 3
Run
The test runs while the team watches for guardrail breaches and sample-ratio mismatch. No analysis decisions until the planned end (or sequential-testing trigger).
experiment-design
Discipline holds: do not peek, do not cherry-pick segments, do not extend without explicit reason.
- Phase 4
Analyze
Apply the decision rule from the spec. Produce the decision artifact: ship, kill, or extend with reasoning.
experimentation-analytics
Validates statistical significance, segment consistency, guardrails. Produces ship/kill recommendation with reasoning and risks.
Artifacts at each stage
What the workflow produces, illustrated.
Each stage of the orchestration produces an artifact. The four shown below are illustrative versions of what an agent would hand off between stages. Real artifacts vary by team, platform, and test scope; these mockups capture the shape.
Phase 1 output
Hypothesis spec
The experiment-design skill produces a structured spec. The spec is the contract: it commits the hypothesis, the thresholds, and the decision rule before the test runs.
Experiment spec · v1
Checkout flow simplification
Drafted by experiment-design skill. Reviewed before launch.
Hypothesis
Reducing the checkout form from 3 steps to 1 step will increase conversion rate by at least 5% relative, with no significant regression in revenue per session.
Sample size calculation
- Baseline conversion
- 4.2%
- Minimum detectable effect
- 5% relative
- Significance level (alpha)
- 0.05
- Power
- 0.80
- Required N per variant
- 14,238
Success criteria
- Primary metric: conversion rate. Ship if treatment beats control by 5%+ at p<0.05.
- Guardrail: revenue per session. Kill if treatment underperforms control by 3%+.
- Guardrail: session duration. Kill if treatment cuts duration by 10%+.
- Decision rule: if inconclusive at end of planned run, extend by 7 days; if still inconclusive, do not ship.
Notes
Excludes mobile web due to a separate experiment running concurrently. Excludes B2B accounts (different funnel). Sequential testing applied to allow early-stop on clear winners.
Phase 2 output
Variant configuration
The feature-flagging skill produces the runtime config: variants, targeting rules, ramp plan. The platform layer is abstracted; the discipline is the same regardless of vendor.
Feature flag · checkout-flow-experiment
ACTIVEVariants
control
50%3-step checkout: shipping → payment → review.
treatment
50%Single-page checkout with progressive disclosure.
Targeting rules
- +user.status == "logged_in"
- +user.country in ["US", "CA", "UK"]
- -user.is_bot == true
- -user.account_type == "b2b"
Ramp plan
- T+0Start at 5% of eligible traffic for 24h smoke check.
- T+1dRamp to 25/25 split if no critical issues surfaced.
- T+3dFull 50/50 split. Begin sample-size accumulation.
- T+21dPlanned end. Run analysis; apply decision rule.
Phase 3-4 output
Live results dashboard
As the test runs, the results dashboard shows the trajectory. The experimentation-analytics skill watches for significance, segment consistency, and guardrail breaches.
Live results · day 21 of 21
SIGNIFICANTControl
4.2%
+/- 0.18% (95% CI)
14,238 sessions, 598 conversions
Treatment
4.7%
+/- 0.19% (95% CI)
14,201 sessions, 668 conversions
Relative lift
+12.0%
p-value
0.024
Segment breakdown
- Mobile+15.4%p=0.012
- Desktop+9.8%p=0.041
- Tablet+11.2%p=0.18
- Organic+13.1%p=0.019
- Paid+10.4%p=0.07
- Direct+12.8%p=0.05
Guardrails
- Revenue per session: +1.8% (within tolerance)
- Session duration: -2.1% (within tolerance)
Status: Statistically significant; meets MDE; no guardrail violations; segment effects consistent. Ready for decision.
Phase 4 output
Decision artifact
The decision artifact is the workflow's output. Ship, kill, or extend, with reasoning, risks, and next steps. The decision is defensible because the spec committed the thresholds before the test.
Decision artifact · produced by experimentation-analytics
SHIP
Recommend rolling treatment to 100% over the next 7 days, then monitor for 14 days post-launch.
Reasoning
- Hit the pre-registered MDE of 5% relative; observed 12.0% lift at p=0.024.
- No guardrail violations: revenue per session and session duration both within tolerance.
- Segment effects consistent across device and traffic source; no major sub-segment underperforms.
- Sequential testing applied; result was stable across the final week of the run.
Risks identified
- Effect size is moderate; the absolute conversion delta is 0.5 percentage points. Monitor for novelty effect over the first 14 days post-launch.
- Mobile lift outsized vs desktop. May reflect mobile-specific friction in the 3-step control. Validate on the long tail.
- B2B accounts excluded from this test. Decision applies to consumer funnel only; B2B funnel needs its own experiment.
Next steps
- 01Ramp treatment to 100% over 7 days. Engineering owns the ramp; PM monitors guardrails daily.
- 02Post-launch monitoring for 14 days. Watch for novelty decay; rollback ready if guardrails breach.
- 03Document the result in the experiment log. Share retro with growth and engineering teams.
- 04Spin up B2B experiment as follow-up; the consumer result does not generalize automatically.
How a Claude agent runs this
The agent transcript, illustrative.
A Claude agent loaded with these skills can chain through the orchestration. The transcript below illustrates the flow. Tool calls shown in cyan; agent reasoning in slate; user input in white.
User
Help me test whether reducing the checkout from 3 steps to 1 step improves conversion. We get about 30K eligible sessions per week.
Agent
Drafting an experiment spec. The traffic supports a 21-day run at the 5% relative MDE; I'll use experiment-design to surface the full sample-size math and guardrails.
Skill
experiment-design
Output
Hypothesis spec: 14,238 N per variant, 5% relative MDE, alpha 0.05, power 0.80. Primary metric conversion rate; guardrails on revenue-per-session and session duration. Decision rule documented.
User
Looks good. Set up the test.
Agent
Picking the platform first; your stack already has Statsig and the experiment fits its pattern. Then configuring the flag.
Skill
experimentation-platform-orchestrator
Output
Recommends Statsig given existing instrumentation and team familiarity. Lists tradeoffs vs PostHog and GrowthBook for context.
Agent
Configuring the feature flag with the ramp plan from the spec.
Skill
feature-flagging
Output
Flag checkout-flow-experiment created. 50/50 split; ramp 5% (24h) -> 25% (day 3) -> 50%. Targeting rules: logged-in users, US/CA/UK, exclude bots and B2B accounts.
User
Test wraps up tomorrow. What did we learn?
Agent
Pulling the results and applying the decision rule from the spec.
Skill
experimentation-analytics
Output
Ship recommendation. Treatment +12.0% relative lift at p=0.024; no guardrail violations; segment consistency holds. Risks: novelty effect possible; mobile lift outsized; B2B not covered. Next steps: ramp to 100%, monitor 14 days, retro.
Variations
Three tiers of the same workflow at different scales.
The full skill cluster fits a flagship version of the workflow. Most teams need lighter cuts more often. The three tiers below describe when each cut fits and which skills carry the work.
Tier 1
Multi-variant test
High-stakes test of a substantial change with 3+ variants and advanced segmentation analysis. Often a launch decision rests on the result.
Time / cost
3-6 weeks; full team + analytics support; full QA cycle
Skills involved
- experiment-design
- feature-flagging
- experimentation-platform-orchestrator
- experimentation-analytics
- product-analytics-setup
Output shape
Multi-variant analysis, segment cuts, decision artifact with risks, post-launch monitoring plan, retro doc.
Tier 2
Full launch experiment
Standard pre-launch validation for medium-stakes feature work. Ships with confidence rather than gut feel.
Time / cost
2-4 weeks; PM-led with engineering support
Skills involved
- experiment-design
- feature-flagging
- experimentation-analytics
- product-analytics-setup
Output shape
Hypothesis spec, variant config, results dashboard, ship/kill decision with reasoning.
Tier 3
Quick UI test
Low-stakes UI change validation: copy tweaks, button placement, simple layout shifts. Bounded blast radius.
Time / cost
1-2 weeks; lightweight setup
Skills involved
- experiment-design
- feature-flagging
- experimentation-analytics
Output shape
Short hypothesis statement, simple flag config, lift number with confidence interval, ship/kill call.
Frequently asked
Questions this walkthrough surfaces.
- How do I pick which experimentation platform to use?
- The experimentation-platform-orchestrator skill covers this in detail. Briefly: pick based on your stack (warehouse-native vs SaaS), your team's analytics maturity, and the test types you run most. Statsig and PostHog suit product-led teams with strong engineering; Optimizely and VWO suit marketing-led teams with simpler statistical needs; GrowthBook suits warehouse-native organizations. The orchestrator skill walks the decision rather than picking for you.
- What if my sample size is too small?
- Three options. First, expand the test scope (more pages, more user segments) if the change applies broadly enough. Second, run longer and accept slower learning velocity; some tests legitimately need 6-8 weeks rather than 2-3. Third, accept smaller MDE and run with less statistical power; the experiment-design skill walks the tradeoff. The failure mode to avoid: running the planned duration anyway with insufficient sample and treating the inconclusive result as a negative finding.
- How do I handle inconclusive results?
- Honest options: extend the run if power was the limiter and you can afford the time; accept the null result and do not ship if the MDE was meaningful and the data does not support it; ship with low confidence and post-launch monitoring if business pressure requires a decision and the risk is bounded. The failure mode to avoid: cherry-picking sub-segments where the result happened to look favorable. The experimentation-analytics skill covers the discipline.
- What's the difference between this walkthrough and the experiment-design skill alone?
- The experiment-design skill is the methodology for designing one experiment well: hypothesis, MDE, sample size, success criteria. This walkthrough is the orchestration: experiment-design produces the spec, experimentation-platform-orchestrator picks the tool, feature-flagging configures rollout, product-analytics-setup validates the underlying data, and experimentation-analytics interprets the result. The skill is a tool; the walkthrough is the workflow that uses several tools together.
- When should I NOT run an AB test?
- Several cases. When the sample size required exceeds your traffic by an order of magnitude (run a different research method instead). When the change is foundational and rolling back would be expensive (do staged rollout with monitoring instead). When the change is primarily aesthetic without clear business impact hypothesis (define the hypothesis first, or skip the test). When you cannot identify any segment that would meaningfully care (the change may not warrant shipping at all).
- How does this walkthrough relate to the launch-a-feature walkthrough?
- AB tests are one tool inside a feature launch. A full feature launch may include an AB test as a stage gate (test the new flow at 5% before ramping); a small UI change may be a standalone AB test that ships or kills based on the result. The launch-a-feature walkthrough covers the broader product-launch orchestration; this walkthrough covers the experimentation discipline specifically.
Metrics shown are illustrative. Actual results vary by platform, methodology, and traffic volume.