Flagship Skill · Experiment design

The experiment design skill.

Hypothesis to decision, without the common traps.

A senior product manager's playbook for running experiments that produce trustworthy decisions. Codifies the discipline that prevents the most common failures: vague hypotheses, underpowered tests, novelty-effect ships, post-hoc segment mining, peeking creep, guardrail overlooks, and the directional-ship trap. Built for PMs and the AI agents working alongside them.

Audience: product managers and growth teams. Adjacent: data scientists who partner with PMs on experimentation.

What this skill is for

The discipline that prevents shipping the wrong thing.

The default state of experimentation in most companies is sloppy. PMs run tests against vague hypotheses, look at results too early, ignore guardrails, stratify into noise, and ship features whose lift is mostly measurement error. The cost is real: ship the wrong thing, kill the right thing, learn the wrong lesson, repeat.

This skill is the discipline that prevents most of those mistakes. It assumes the team has a working experimentation platform; it does not advocate for one. It assumes the team can deliver real treatment changes; it does not cover engineering. The hard part is the thinking, and that is what is here.

The output is not statistics homework. The output is a defensible decision: ship, kill, or inconclusive, with a written rationale that survives scrutiny in a room of skeptical stakeholders six months later.

What is in the skill

Twelve considerations covered in the body.

The SKILL.md spans the full experiment lifecycle from hypothesis through decision. Each section names a common failure mode and the discipline that prevents it.

  1. 01

    Hypothesis discipline

    Cause, effect, magnitude, and mechanism. The hypothesis names what is being tested, what should move, by how much, and why. Vague hypotheses are the root cause of most experiment failures.

  2. 02

    Sample size and minimum detectable effect

    Whether the test has enough traffic to detect the effect at the chosen power. Refuse to run underpowered tests; they produce noise dressed up as evidence.

  3. 03

    Test duration

    Longer of the sample-size-hit duration and a full weekly cycle. UI/UX changes need at least 14 days regardless. Holdouts for permanent feature changes, capped at 4 to 6 weeks before the world changes around the test.

  4. 04

    What NOT to A/B test

    UX bugs, legal-required changes, brand-philosophy questions, decisions already made, designs whose randomization cannot be clean. Some questions are not experiment-shaped.

  5. 05

    Segment analysis

    Pre-registered segments are evidence; post-hoc segments are noise mining. The multiple comparisons problem is real; with 20 segments at p=0.05, expect one false positive purely by chance.

  6. 06

    Interaction effects

    Concurrent tests on the same surface can interfere. Mutex enforcement or coordination required. Post-hoc detangling is hard, expensive, and usually inconclusive; coordinate up front.

  7. 07

    Ratio metrics and the delta method

    Naive variance estimators on ratios understate uncertainty. Confirm the platform uses a ratio-aware estimator (delta method, bootstrap). Otherwise confidence intervals are wrong in the direction that matters.

  8. 08

    Network effects and two-sided markets

    Treatment can leak into control via interference. Cluster randomization, switchback experiments, or geographic isolation when needed. Sometimes the right answer is qualitative research, not experimentation.

  9. 09

    Sequential testing and the peeking problem

    Daily peeking inflates false positive rates from 5 percent toward 30 percent on a four-week test. Use sequential testing methods when available; pre-commit to a single end-of-test analysis otherwise.

  10. 10

    Pre-commitment vs p-hacking

    Write down the primary metric, MDE, duration, segments, and decision rule before launch. Apply mechanically when results come in. Pre-commitment is the discipline; p-hacking is the absence of it.

  11. 11

    Reading results and making the call

    Three buckets: clear win (ship), clear loss (kill), inconclusive (the hardest case). The inconclusive bucket exists for a reason; resist the pull to ship anyway.

  12. 12

    Common failures and fixes

    A rapid-fire pattern catalog: novelty effect ships, post-hoc segment wins, peeking creep, guardrail violations, campaign confounds, directional ships, interaction blind spots, and more.

Reference files

Seven references that go alongside the SKILL.md.

The references hold the templates, tables, checklists, and pattern catalogs the SKILL.md cites. Each is a self-contained doc the PM can lift into a project without reading the rest.

  • references/hypothesis-templates.md

    Five concrete templates with worked examples across conversion, engagement, revenue, retention, and funnel-step metric types. Includes anti-templates that should be rewritten before launch.

  • references/sample-size-tables.md

    Pre-calculated sample size tables for common conversion-rate baselines and MDEs. How-to-use guidance and the common pitfalls in sample-size planning.

  • references/common-failures.md

    Fifteen anti-patterns that produce wrong shipping decisions. Each with symptom, root cause, fix, and prevention.

  • references/results-interpretation-checklist.md

    Eight-step checklist for reading results, the three-bucket decision matrix, and the 30-day post-launch monitoring rule.

  • references/platform-comparison.md

    Profiles of the seven major experimentation platforms with strengths, gotchas, and a decision matrix for choosing. Pairs with the per-platform integration microsites for MCP setup.

  • references/pre-experiment-readiness-checklist.md

    Ten-item go/no-go checklist run through before launching any experiment. If any item is no, delay the launch until it is yes.

  • references/post-experiment-decision-framework.md

    The moment-of-decision framework: confirm pre-commitment, apply rule mechanically, route to ship / kill / inconclusive paths, write the post-mortem within a week.

Browse all reference files on GitHub

Where to use it

The full experiment lifecycle.

Pre-experiment. Read the relevant sections before designing the test. The pre-experiment-readiness checklist is the gate; if any item is no, delay the launch. The hypothesis templates and sample-size tables are the tools.

During the experiment. The duration and peeking sections protect the discipline while the test runs. Pre-commit to the analysis date; use sequential testing if the platform supports it; do not make decisions on intermediate peeks.

Post-decision. The results-interpretation checklist and the post-experiment-decision-framework cover the moment of decision. Confirm pre-commitment was followed, apply the rule mechanically, route to ship / kill / inconclusive, write the post-mortem within a week.

Post-launch. Even shipped experiments need the 30-day rule: production behavior reviewed at +7, +14, and +30 days. Most successful tests behave the same way in production as in the test; the ones that diverge are usually telling you something is wrong.

Where this skill goes next

Flagship of the PM-experimentation suite.

Experiment design is the first of three foundational skills in the PM-experimentation suite. The other two are forthcoming.

feature-flagging covers the operational layer below: flag taxonomy, environment management, change request workflows, and stale flag cleanup. The skill experiment-design uses feature flags as the delivery mechanism for treatment variants; feature-flagging documents the flag side as its own discipline.

experimentation-analytics covers the analytical layer above: variance reduction techniques (CUPED, stratified sampling, control variates), Bayesian alternatives, sequential testing math, and the deeper interpretation of marginal results.

An optional fourth skill, experimentation-platform-orchestrator, may follow after the three foundational skills land. That skill schedules; this skill designs.

Skill cross-link hyperlinks land when the skill pages ship. For now, the names are placeholders so PMs know what is coming.

Open source under MIT

Read the SKILL.md on GitHub.

The skill source lives in the rampstackco/claude-skills repository alongside sixty-two other skills covering the full lifecycle of brand and product work. MIT licensed.

Frequently asked questions.

How is this different from a stats course?
A stats course teaches you how to compute a t-test or interpret a confidence interval. This skill teaches you when to run an experiment in the first place, how to write a hypothesis that survives the result, and how to make a defensible decision. The math is in the references where it matters; the discipline is in the body. Stats courses produce people who can compute correctly; this skill produces people who design correctly.
Does it depend on a specific experimentation platform?
No. The principles work on Statsig, PostHog, GrowthBook, Optimizely, Amplitude, Eppo, and Kameleoon equally. The platform-comparison reference helps with choosing if the team has not picked one yet. For platform-specific MCP commands and example prompts, pair this skill with the matching /integrations/{platform} microsite.
What about Bayesian experiments?
The skill is platform- and methodology-agnostic for the discipline. Most of the principles (hypothesis discipline, MDE planning, what NOT to test, segments, interactions, decision-making) apply to Bayesian and frequentist experiments equally. The variance-estimation and sequential-testing sections lean frequentist because that is what most platforms ship; the experimentation-analytics skill (forthcoming) will cover Bayesian alternatives in depth.
Why does the skill say not to A/B test some things?
Some questions are not experiment-shaped. UX bugs, legal-required changes, brand strategy questions, decisions already committed to. Running tests on those is theater that erodes trust in the discipline. The skill names them explicitly so the team has a vocabulary for declining.
What is the inconclusive bucket about?
Most teams treat experiment results as binary: shipped or not. Reality is three buckets: clear win, clear loss, inconclusive. The inconclusive bucket is the most common and the hardest because the temptation is to ship anyway since the team has invested in the hypothesis. The skill defends the inconclusive bucket as a valid outcome and provides a structured path for resolving it.