Flagship Skill · Experimentation analytics

The experimentation analytics skill.

Read result panels without fooling yourself.

A data-team-mentor's playbook for interpreting experiment results. Codifies the discipline that prevents the most common interpretation failures: p-value worship, peek-driven early stops, post-hoc segment mining, launch-dilution misreads, CUPED misinterpretation, ratio-metric variance errors, and the directional-ship trap. Built for product managers and data analysts reading results together.

Audience: product managers and data analysts. Adjacent: growth teams and engineers who own the platform.

What this skill is for

The discipline that prevents misreading the result panel.

The result panel is the moment-of-truth for an experiment. The numbers on it determine whether you ship, kill, or iterate. They also expose every shortcut taken in the design phase: an underpowered test produces wide confidence intervals; a peeked test produces a too-narrow p-value; a ratio metric without delta-method correction produces overconfident lift estimates. Most ship-the-wrong-thing decisions trace back to misreading the result panel.

This skill is the discipline that prevents misreading. It assumes the experiment was designed well. It assumes the platform is technically correct (most modern platforms are). It assumes you can read a number off a screen. The hard part is knowing what each number actually means and what it does not, and that is what is here.

The output is not statistics homework. The output is a defensible decision: ship, kill, or inconclusive, with a written rationale that survives scrutiny in a room of skeptical stakeholders six months later.

What is in the skill

Fourteen considerations covered in the body.

The SKILL.md spans the full result-interpretation lifecycle from reading the panel through communicating the decision. Each section names a common failure mode and the discipline that prevents it.

  1. 01

    The result panel

    What every modern platform should expose: variants, allocation, per-variant metrics with CIs, lift, significance, variance reduction, guardrails, segments, time series. Anything missing is a red flag.

  2. 02

    Confidence intervals

    The single most important number on the panel. Width matters more than center. Five practical decision rules cover all-positive, all-negative, narrow-around-zero, and wide-around-zero cases.

  3. 03

    P-values

    What they mean (probability of the data under the null) and what they do not (probability the treatment works). The 0.05 cutoff is convention, not law. Always read alongside the CI.

  4. 04

    Multiple testing corrections

    With twenty comparisons at alpha 0.05, expect one false positive purely by chance. Bonferroni for family-wise control, Benjamini-Hochberg for false discovery rate. Pre-register primary metrics; treat the rest as exploratory.

  5. 05

    Sequential testing math

    Daily peeking inflates false positive rate from 5 percent toward 30 percent on a four-week test. Always-valid p-values (mSPRT, group sequential) survive peeking. Statsig, Eppo, GrowthBook support; older platforms do not.

  6. 06

    CUPED variance reduction

    Same point estimate, narrower CI by 30 to 50 percent. Effective sample size doubles. Use whenever pre-experiment data is available and informative. CUPED reduces variance, not the lift; do not misinterpret a smaller-looking lift after CUPED.

  7. 07

    Heterogeneous treatment effects

    Treatment works differently for different segments. Pre-registered segment effects are evidence; post-hoc segments are noise mining. Segment-only shipping requires the segment, the targeting infrastructure, and the maintenance commitment.

  8. 08

    Ratio metrics and the delta method

    Conversion rate, click-through rate, revenue per user are all ratios. Naive variance estimators understate uncertainty and produce false positives. Verify your platform uses delta method or bootstrap.

  9. 09

    Bayesian vs frequentist panels

    Most platforms support both. Pick one per experiment and stick with it; switching mid-flight is the Bayesian-frequentist version of p-hacking. Both produce similar ship decisions when the experiment is designed correctly.

  10. 10

    Network effects and SUTVA violation

    Marketplaces, social products, supply-constrained features. Treatment leaks into control via interference, undercounting the true effect by 2x to 3x. Detect via cluster randomization, switchback, or geographic experiments.

  11. 11

    Dashboard vs experiment reconciliation

    The BI number rarely matches the experiment number. Different denominators, time windows, external effects, selection effects, definitions, or pipeline lag. Communicate the difference without losing stakeholder trust.

  12. 12

    Long-term effect estimation

    Most tests run two to four weeks; many decisions need 30 to 90 days. Holdout groups, geo experiments, and difference-in-differences cover the long-term measurement need. Set up the holdout at launch, not later.

  13. 13

    Common interpretation failures

    P-value worship, early stop on a peek, post-hoc segment mining, launch dilution misread, guardrail reframing, CUPED misinterpretation, opposite-segment shipping, directional ship, and ten more patterns the skill catalogs.

  14. 14

    The discipline of inconclusive

    Most results are not clean ship-or-kill. The hardest call is 'we do not have enough signal to ship.' The discipline of saying it is the discipline of caring about being right more than being decisive.

Reference files

Seven references that go alongside the SKILL.md.

The references hold the cheatsheets, technical depth, communication templates, and pattern catalogs the SKILL.md cites. Each is a self-contained doc the team can lift into a project without reading the rest.

  • references/confidence-interval-cheatsheet.md

    How to read a CI, what to ignore, the five decision rules with worked examples for each. Includes the 'looks great' trap walkthrough.

  • references/p-value-interpretation-guide.md

    What the p-value means, what people pretend it means, the 0.05 convention, the peeking problem, multiple testing context, and Bayesian alternatives. Communication templates for non-technical stakeholders.

  • references/statistical-method-reference.md

    The technical reference for analysts. CUPED, delta method, sequential testing (mSPRT, group sequential, anytime-valid), HTE handling, multiple testing corrections, cluster randomization. Verification questions for each.

  • references/dashboard-vs-experiment-reconciliation.md

    Why the BI number rarely matches the experiment number. The blended-attribution trap. Six-step reconciliation checklist for the moment of stakeholder confusion.

  • references/result-presentation-templates.md

    Five templates for stakeholder communication: clear win, clear loss, inconclusive (the most common, hardest), mixed (positive primary with ambiguous guardrail), and long-term holdout report.

  • references/analytics-platform-comparison.md

    Profiles of seven platforms (Statsig, PostHog, Optimizely, GrowthBook, Eppo, Amplitude, Kameleoon) covering what each exposes, what each hides, and the gotchas in each. Quick comparison table.

  • references/common-interpretation-failures.md

    Sixteen failure patterns: name, symptom, root cause, fix, prevention. The pattern behind the patterns and the defense against motivated reasoning.

Browse all reference files on GitHub

Where to use it

The full result-interpretation lifecycle.

When results land. Read the relevant sections before making the call. Confidence interval, p-value, multiple testing, sequential testing, CUPED, ratio metric handling. The cheatsheets and the statistical-method reference are the tools.

When the segment story gets interesting. Pre-registered segments are evidence. Post-hoc segments are noise mining. The HTE section and the common-failures reference cover the line and how to stay on the right side of it.

When the dashboard disagrees with the experiment. The reconciliation reference covers the six common explanations and the checklist for the moment of stakeholder confusion. Communicate the difference without losing trust in either number.

When you need to communicate the result. The five presentation templates (clear win, clear loss, inconclusive, mixed, long-term holdout) cover the most common stakeholder messages. Use the right template for the actual result, not the one that flatters the team.

Where this skill goes next

Third skill in the PM-experimentation suite.

Experimentation analytics is the third of three foundational skills in the PM-experimentation suite. It pairs with experiment-design (pre-experiment thinking) and feature-flagging (operational mechanics). Read whichever one fits the current phase of the work.

experiment-design covers the pre-experiment layer: hypothesis writing, sample size and MDE planning, test duration, segment pre-registration, what NOT to A/B test, and the pre-commitment discipline that makes results trustworthy when they land. Read it before designing the test.

feature-flagging covers the operational layer below: flag taxonomy, environment management, change request workflows, and stale flag cleanup. The skill experimentation-analytics assumes the flag side is well-managed; feature-flagging documents that side as its own discipline. Skill page lands when feature-flagging ships.

An optional fourth skill, experimentation-platform-orchestrator, may follow after the three foundational skills land. That skill schedules; this skill interprets.

Open source under MIT

Read the SKILL.md on GitHub.

The skill source lives in the rampstackco/claude-skills repository alongside dozens of other skills covering the full lifecycle of brand and product work. MIT licensed.

Frequently asked questions.

How is this different from a stats course?
A stats course teaches you what a confidence interval is. This skill teaches you how to read one on a result panel and decide whether to ship. The math is in the references where it matters; the discipline is in the body. Stats courses produce people who can compute correctly; this skill produces people who interpret correctly under shipping pressure.
Do I need to be a data scientist to use this?
No. The audience is product managers and data analysts working together. The skill explains each concept in terms a PM can act on without pretending the statistics are simpler than they are. The references go deeper for the analyst on the same team.
How does it pair with the experiment-design skill?
Experiment-design is the pre-experiment thinking: hypothesis, sample size, MDE, what NOT to test. Read it before designing the test. Experimentation-analytics is the during-and-after-experiment thinking: how to read the result panel, when the numbers are trustworthy, how to communicate the result. Read it when results land. Together with the feature-flagging skill, the three cover the full PM-experimentation lifecycle.
Why so much focus on confidence intervals over p-values?
Most teams over-rely on the p-value as a binary ship signal and under-read the CI. The CI tells you the magnitude of the effect; the p-value tells you the strength of evidence against the null. Both matter, but the CI is what determines whether the lift is large enough to justify the implementation cost. The width of the CI is the precision signal; the position of zero relative to the CI is the existence signal. Reading just the p-value loses the magnitude.
What is CUPED and why does the skill spend a section on it?
CUPED uses pre-experiment behavior of the same users to subtract out their baseline, leaving a cleaner signal of the treatment effect. Same point estimate, narrower CI, often by 30 to 50 percent. Effective sample size roughly doubles for free. Most modern platforms (Statsig, Eppo, GrowthBook, parts of PostHog and Amplitude) support it. Knowing how to read CUPED-adjusted results without misinterpreting them is one of the higher-impact statistical skills for PMs and analysts.
What happens when the dashboard and the experiment disagree?
Almost always one of: different denominators, different time windows, external effects the experiment correctly excludes, selection effects in enrollment, different metric definitions across systems, or pipeline lag. The dashboard-vs-experiment-reconciliation reference covers each pattern with how to diagnose and how to communicate the difference to stakeholders without losing trust in either number.