When is there enough traffic to A/B test?

Roughly 5,000 or more monthly conversions per variant. Below that, A/B testing will not produce reliable results, so iterate through design changes and qualitative research instead. The exact sample depends on the baseline conversion rate and the minimum detectable effect: at a 2 percent baseline, detecting a 20 percent relative lift needs around 19,000 per variant, while a 10 percent baseline detecting a 20 percent lift needs around 3,500. Run a calculator before committing.

How do I write a good hypothesis?

Use the structure: 'Because [observation from the audit], we believe that [specific change] will produce [measurable outcome with a target] for [user segment], because [reason it would work].' A good hypothesis names a specific change (not 'improve the design'), a measurable outcome, grounding in evidence, and a known mechanism. 'Make it better' is not a hypothesis. Prioritize candidates with ICE or PIE (impact, confidence, ease), scoring each and testing the highest combined scores first.

Why should I not stop a test the moment it hits significance?

Because peeking and early stopping bias the result toward false positives: an early significant reading often regresses as more data arrives. Set the required sample size and a minimum duration (at least two weeks to cover a full business cycle) before launch, and run to them. Also avoid running multiple overlapping tests on the same flow, testing during atypical periods like holidays, and re-running a test until it 'wins,' which manufactures a false positive.

What is the difference between 95% significance and a 95% chance of winning?

A 95 percent significance level means that if there were truly no difference between the variants, there would be only a 5 percent chance of seeing results this extreme by chance. That is not the same as a 95 percent chance the variant is the winner. Many CRO tools instead report Bayesian probabilities (a '95 percent chance of being best'), which is a different statement, so read the methodology your specific tool uses before reading its number as a guarantee.

Why define guardrail metrics?

Because optimizing a single metric can quietly damage another. A test that lifts conversion but craters average order value is a net loss, and a single-metric obsession misses it. Define one primary metric plus guardrail metrics that must not go down, and analyze results by segment as well as overall, since an overall lift can hide negative impact on an important segment that a top-line number conceals.

Why document losing tests?

So the same hypothesis is not re-tested three times, and so the program learns. CRO compounds across tests: capturing the lesson whether a test wins or loses is where programs actually move the needle over time. Treating each test in isolation throws that compounding away, and the documentation gap (wins captured, losses forgotten) is a common and expensive failure.

Skill · CRO optimization

CRO optimization.

Audit, hypothesize, test, decide.

Run conversion rate optimization as a structured discipline rather than a pile of guesses: audit the funnel, generate hypotheses grounded in evidence, design tests that produce unambiguous answers, then decide on the data. The work is for testing existing pages and flows that already carry traffic.

The discipline is statistical. Conversion testing needs more sample than people intuit, and the program that records its losses compounds where one-off tests do not.

Audience: growth and product teams optimizing a page or flow that converts below expectation, diagnosing a high-drop-off funnel step, or interpreting an ambiguous test result.

View the skill on GitHub Browse the full catalog

The framework

Diagnose before you treat.

CRO runs as a loop: audit, hypothesize, test, decide. Each phase feeds the next, and the lessons compound back into the audit.

01Audit: quantitative (funnel drop-off, segmentation, performance), qualitative (session replay, heatmaps, surveys of abandoners, form analytics), and heuristic. The audit produces the friction points that become hypotheses.
02Hypothesis: a testable statement grounded in the audit. 'Because [observation], we believe [change] will produce [outcome] for [segment], because [reason].' Prioritize with ICE or PIE.
03Test design: a sample size from the baseline rate and minimum detectable effect, a minimum duration covering a full business cycle, one primary metric, guardrail metrics, and decision criteria set before launch.
04Decide: ship a clear winner, kill a clear loser and keep the lesson, and do not ship a tied variant. Investigate when an overall win hides a losing segment.

The discipline

More sample than you think, and no peeking.

Conversion testing needs more sample than people intuit. Run a sample-size calculation before launch from the baseline rate and the minimum effect worth caring about, then run to that sample and a minimum duration that covers a full business cycle, usually at least two weeks to capture weekends and weekly patterns. Below roughly 5,000 monthly conversions per variant, A/B testing will not produce reliable results, and the work shifts to design changes and qualitative research instead.

The fastest route to a false positive is peeking and stopping the moment significance appears. Set the sample size, the duration, and the decision criteria before launch, and hold to them. Define one primary metric and guardrail metrics that must not drop, because a variant that lifts conversion while it craters average order value is a net loss, and analyze by segment, since an overall win can hide real damage to a particular one.

Decide on the data, not the loudest opinion in the room. A 95 percent significance level means that if there were truly no difference, there is only a 5 percent chance of seeing results this extreme, which is not the same as a 95 percent chance the variant wins. Capture the learning whether a test wins or loses, because CRO compounds across rounds: the program that records its losses stops re-testing the same hypothesis three times.

Pairs with these platforms

The A/B testing tools that run the experiments.

CRO needs an A/B testing tool to split traffic and measure variants. These platforms run the tests: Optimizely and VWO as dedicated experimentation suites, PostHog and GrowthBook for product-led testing, and Statsig for experimentation at scale.

Reference files

The reference that goes alongside the SKILL.md.

references/hypothesis-library.md
Common high-impact hypothesis patterns organized by funnel stage.

Browse all reference files on GitHub

Bridges to other skills

What makes CRO possible, and what it tests.

CRO sits on a measurement foundation and tests changes the writing and form skills produce.

The foundation
analytics-strategy
Sets up the tracking and funnel data CRO depends on. Without the events and dashboards in place, there is nothing to audit or measure a test against.
Writes the variant
landing-page-copy
Writing a page from scratch, with hero, value proposition, and objection handling, is its job. CRO tests changes to a page that already exists.
Strategy-level why
ux-research
When the question is messaging or strategy rather than a tweak, qualitative research answers it first. CRO cannot optimize a fundamentally wrong direction.
The form in the funnel
form-strategy
Form analytics often surface the abandonment a CRO audit finds. That skill redesigns the form; CRO measures whether the redesign converts.

Open source under MIT

Read the SKILL.md on GitHub.

The skill source lives in the rampstackco/claude-skills repository alongside dozens of other skills covering the full lifecycle of brand and product work. This page is a structured overview; the SKILL.md is the source. MIT licensed.

View SKILL.md Browse the full catalog

Frequently asked questions.

When is there enough traffic to A/B test?: Roughly 5,000 or more monthly conversions per variant. Below that, A/B testing will not produce reliable results, so iterate through design changes and qualitative research instead. The exact sample depends on the baseline conversion rate and the minimum detectable effect: at a 2 percent baseline, detecting a 20 percent relative lift needs around 19,000 per variant, while a 10 percent baseline detecting a 20 percent lift needs around 3,500. Run a calculator before committing.
How do I write a good hypothesis?: Use the structure: 'Because [observation from the audit], we believe that [specific change] will produce [measurable outcome with a target] for [user segment], because [reason it would work].' A good hypothesis names a specific change (not 'improve the design'), a measurable outcome, grounding in evidence, and a known mechanism. 'Make it better' is not a hypothesis. Prioritize candidates with ICE or PIE (impact, confidence, ease), scoring each and testing the highest combined scores first.
Why should I not stop a test the moment it hits significance?: Because peeking and early stopping bias the result toward false positives: an early significant reading often regresses as more data arrives. Set the required sample size and a minimum duration (at least two weeks to cover a full business cycle) before launch, and run to them. Also avoid running multiple overlapping tests on the same flow, testing during atypical periods like holidays, and re-running a test until it 'wins,' which manufactures a false positive.
What is the difference between 95% significance and a 95% chance of winning?: A 95 percent significance level means that if there were truly no difference between the variants, there would be only a 5 percent chance of seeing results this extreme by chance. That is not the same as a 95 percent chance the variant is the winner. Many CRO tools instead report Bayesian probabilities (a '95 percent chance of being best'), which is a different statement, so read the methodology your specific tool uses before reading its number as a guarantee.
Why define guardrail metrics?: Because optimizing a single metric can quietly damage another. A test that lifts conversion but craters average order value is a net loss, and a single-metric obsession misses it. Define one primary metric plus guardrail metrics that must not go down, and analyze results by segment as well as overall, since an overall lift can hide negative impact on an important segment that a top-line number conceals.
Why document losing tests?: So the same hypothesis is not re-tested three times, and so the program learns. CRO compounds across tests: capturing the lesson whether a test wins or loses is where programs actually move the needle over time. Treating each test in isolation throws that compounding away, and the documentation gap (wins captured, losses forgotten) is a common and expensive failure.

CRO optimization.

Diagnose before you treat.

More sample than you think, and no peeking.

The A/B testing tools that run the experiments.

Optimizely

VWO FME

PostHog

GrowthBook

Statsig

The reference that goes alongside the SKILL.md.

What makes CRO possible, and what it tests.

Read the SKILL.md on GitHub.

Frequently asked questions.