Flagship Skill · Data warehouse experimentation
The data warehouse experimentation skill.
Run experiments out of the warehouse: SQL assignment, dbt metrics, CUPED variance reduction.
A senior data scientist's playbook for running experiments natively out of BigQuery, Snowflake, or any modern data warehouse, with metric definitions in dbt and statistical analysis in SQL or Python. SQL-based assignment, exposure logging discipline, variance reduction with CUPED, sequential testing, and the operational tradeoffs vs platforms like Statsig and Optimizely.
Audience: data scientists, experimentation analysts, and data-savvy PMs at companies with mature data infrastructure who run experiments out of the warehouse instead of (or alongside) a dedicated experiment platform.
What this skill is for
The operational execution model when the warehouse IS the experiment platform.
Most companies that run experiments at scale use a dedicated platform. Statsig, Optimizely, LaunchDarkly with experimentation, PostHog, Amplitude Experiment. The platforms are good. They handle assignment, instrumentation, and analysis in one product.
There is a different operational model that mature data teams increasingly choose: warehouse-native experimentation. Assignment happens in code or via feature flags. Exposure events fire to the warehouse like any other event. Metrics are defined as dbt models. Statistical analysis runs as SQL or in a Python notebook. The experiment platform is just the existing data stack.
This skill covers the operational execution: assignment patterns, exposure logging discipline, metric definitions in dbt, t-tests and CUPED in SQL or Python, sequential testing, and the pitfalls that take down homegrown setups. It composes with the methodology and interpretation skills (linked below); it does not replace them.
How this skill composes
Five experimentation skills, each with a distinct role.
The whole value of this skill comes from how it composes with the existing experimentation suite. Each skill below has a distinct role; together they cover the full experimentation lifecycle.
Methodology
experiment-designThe tool-agnostic methodology for designing experiments: hypotheses, sample size, randomization unit, primary metric, what NOT to test. Read it before designing the experiment regardless of where it runs.
Interpretation
experimentation-analyticsThe tool-agnostic interpretation discipline: confidence intervals, p-values, multiple testing, sequential testing, CUPED, ratio metrics with the delta method, dashboard reconciliation. Read it when results land.
Decision
experimentation-platform-orchestratorThe platform-vs-warehouse decision in detail. Read it first if you are still deciding. This skill assumes you have already decided warehouse-native is the right call.
Assignment infrastructure
feature-flaggingThe flag-management discipline that this skill assumes. Flag types, naming, lifecycle, targeting, rollout, stale flag cleanup, governance. Read it for the assignment-infrastructure context that warehouse-native experimentation builds on.
Operational execution (this skill)
data-warehouse-experimentation
The operational execution model when no platform is involved. SQL assignment, exposure logging, dbt metric definitions, statistical analysis in SQL or Python, variance reduction with CUPED, sequential testing.
The distinction. The other skills tell you what to do (methodology, interpretation, decision, infrastructure). This skill tells you how to execute when no vendor platform is doing the work for you.
What is in the skill
Thirteen sections covered in the body.
The SKILL.md spans the full operational lifecycle from the platform-vs-warehouse decision through assignment, exposure logging, dbt metric definitions, statistical analysis, variance reduction, and sequential testing.
01
When warehouse-native is the right call
Six factors push toward warehouse-native: cost at volume, custom metrics, custom segmentation, trust requirements, existing data team strength, iteration speed on metric definitions. Five factors push toward a platform: frontend visual experiments, sub-week velocity, weak data infrastructure, mobile, sequential testing out of the box.
02
The architecture
Four components: assignment (hash, flag, or table), exposure logging (single discrete event), metric definitions (dbt models), analysis (SQL or Python). The exposure event pattern is critical for distinguishing exposed users from assigned-but-unexposed.
03
Assignment patterns
Deterministic hash assignment (default), feature flag assignment (when flag service is source of truth), randomized assignment table (compliance use case). Hash with a unique salt per experiment to prevent correlation across experiments.
04
The exposure log
The single most important discipline. Fire exposure when the variant-specific behavior is shown, not at page load. The delayed-exposure trap dilutes the real effect. Single-fire per user per experiment via client-side cache or server-side state.
05
Metric definitions in dbt models
Same dbt model feeds the board dashboard and the experiment analysis. The exp_metrics namespace pattern. Versioning when semantics change. Aligned definitions prevent the experiment-vs-board disagreement problem.
06
Statistical analysis in SQL
Welch's t-test in SQL produces the t-statistic and standard error; convert to p-value via UDF or in Python. Fine for simple t-tests on continuous metrics. Switch to Python for anything more complex.
07
Statistical analysis in Python
scipy and statsmodels for the full statistical ecosystem. One notebook per experiment, parameterized by experiment_id, version-controlled. The notebook produces a written-up decision that archives in a queryable repository.
08
Variance reduction: CUPED and beyond
CUPED reduces variance 30 to 50 percent on engagement metrics. Subtract predicted variance based on pre-experiment behavior. Stratification, regression adjustment, doubly robust estimation cover edge cases. CUPED first; the rest later.
09
Pre-experiment power analysis
Solve for sample size given MDE, or for MDE given sample size. Most underpowered experiments come from optimistic effect-size assumptions. Calibrate against the historical distribution of past observed effects, not against wished-for effects.
10
Sequential testing patterns
mSPRT, confidence sequences, group sequential designs. Each requires implementation discipline. The honest middle ground: if you do not understand the math, do not peek. Move to sequential testing when the team has both expertise and operational pressure.
11
Common pitfalls
Eleven failure patterns: page-load exposure, salt collision, p-hacked 0.04, suspect 30 percent lift, sample ratio mismatch, non-deterministic queries, board-vs-experiment metric drift, never-finished experiment, segment effect overclaiming, Simpson's paradox, statistical significance with tiny effect.
12
The framework: 12 considerations
Platform vs warehouse decision, assignment unit, salt, exposure logging, SRM check, metric reuse, power analysis, variance reduction, statistical method, sequential testing, multiple comparisons, decision documentation. The output: a pre-registered experiment record.
13
The build-vs-buy decision is real
Warehouse-native is powerful but expensive in engineering time. New teams: platform. Mature teams with strong data infrastructure: hybrid or warehouse-native primary. The decision is not permanent; revisit annually. The platform fee that looks expensive is often cheaper than three months of a data engineer's time.
Reference files
Eight references that go alongside the SKILL.md.
The references hold the SQL templates, the dbt model patterns, the statistical analysis templates, the CUPED math, the power analysis math, the sequential testing implementations, and the failure pattern catalog. Each is a self-contained doc the team can lift into a project without reading the rest.
references/warehouse-vs-platform-decision.md
Side-by-side comparison: platform vs warehouse-native. Cost considerations at 10K, 100K, 1M, 10M+ MAU. Three hybrid patterns. Migration patterns in both directions. Decision tree. When the math is wrong.
references/assignment-and-exposure-patterns.md
Hash assignment SQL templates for BigQuery, Snowflake, Postgres. Three-arm split and sub-sampling templates. Salt naming conventions. Exposure event schema. The delayed-exposure trap with a worked example. SRM check.
references/metric-definitions-in-dbt.md
The exp_metrics namespace pattern. Four metric model patterns: revenue, engagement, retention, ratio. Versioning. Aligning experiment metrics with board metrics. dbt tests. Common metric-definition mistakes.
references/statistical-analysis-templates.md
SQL and Python templates for Welch's t-test, proportions test (Wilson interval), Mann-Whitney U test, bootstrap confidence interval. The recommended notebook structure: parameters, SRM check, data pull, primary analysis, secondary metrics, written-up decision.
references/variance-reduction-techniques.md
CUPED math and Python implementation. When CUPED works well vs poorly. Pre-experiment window length guidance. Stratification. Regression adjustment via OLS. Doubly robust estimation primer with academic references.
references/power-analysis-calculations.md
Solving for sample size given MDE, and for MDE given sample size. Three patterns for calibrating effect-size assumptions from historical experiments. The underpowered-experiment cost. When the math says you cannot run this and the three escalation paths.
references/sequential-testing-patterns.md
mSPRT explanation and Python implementation sketch. Confidence sequences (Always-Valid Inference). Group sequential designs with O'Brien-Fleming boundaries. Three scenarios for when to peek and when not to. The honest recommendation.
references/common-pitfalls.md
Eleven failure patterns with name, symptom, root cause, fix, prevention. Page-load exposure, lift in control, p = 0.04 ship, suspect 30 percent lift, SRM, non-deterministic queries, board-vs-experiment drift, runs forever, iOS 3x lift, Simpson's paradox, significance with tiny effect.
Pairs with these platforms
Four data and analytics platforms in /integrations.
The skill is platform-agnostic for the discipline. Open the integration microsite for the platform you are running on for the platform-specific MCP commands and SQL features. BigQuery and Snowflake host the data. dbt holds the metric definitions. Hex hosts the reproducible analysis notebooks.
Mixpanel is intentionally omitted from this skill's pairings. It is an event analytics platform, not warehouse-native infrastructure. Many teams use both warehouse-native experimentation and Mixpanel for product analytics; the two are complementary, not competitive. See the product-analytics-setup skill for the Mixpanel pairing.
Data teams, PMs, and analysts running warehouse-native experimentation and analytics
BigQuery
Google's official managed MCP for BigQuery
Open the pageData teams running enterprise warehousing and Cortex AI workloads
Snowflake
Snowflake-managed MCP with Cortex AI tools
Open the pageAnalytics engineers and data teams transforming warehouse data
dbt
Official dbt MCP for project metadata, lineage, and Semantic Layer
Open the pageData teams and PMs running exploratory analytics in collaborative notebooks
Hex
Hex's official MCP for projects, threads, and data discovery
Open the page
Where this skill fits in the track
The second skill in the PM gap-closing track.
The PM gap-closing track covers three skills focused on the data and experimentation work product teams actually need to ship. data-warehouse-experimentation is the second; the third follows in a subsequent dispatch.
product-analytics-setup covers instrumentation execution: how to set up trackable product analytics that produce trustable answers. Event taxonomy, property design, naming conventions, schema versioning, funnel design, cohort definitions, retention measurement.
data-warehouse-experimentation (this skill) covers running experiments out of the warehouse: SQL assignment, exposure logging, dbt metric definitions, statistical analysis, CUPED variance reduction, sequential testing.
feature-launch-playbook covers the launch lifecycle: pre-launch readiness, staged rollout, monitoring, decision rules, post-launch retrospective. Skill landing page lands when the SKILL.md ships.
Together the three close the PM data gap: instrument correctly, experiment rigorously, launch with a plan. The /integrations catalog covers the platform-specific tactical layer underneath each.
Open source under MIT
Read the SKILL.md on GitHub.
The skill source lives in the rampstackco/claude-skills repository alongside dozens of other skills covering the full lifecycle of brand and product work. MIT licensed.
Frequently asked questions.
- How is this different from the experimentation methodology skills?
- experiment-design covers methodology (hypotheses, sample size, randomization). experimentation-analytics covers interpretation (confidence intervals, p-values, effect size). experimentation-platform-orchestrator covers the platform-vs-warehouse decision. feature-flagging covers assignment infrastructure. data-warehouse-experimentation (this skill) covers the operational execution model when no platform is involved: SQL assignment, exposure logging, dbt metrics, statistical analysis. The other skills tell you what to do; this one tells you how to do it without a vendor platform.
- When should a team use warehouse-native instead of a platform?
- When the team has strong data infrastructure (warehouse, dbt, data scientists), needs custom metrics the platform cannot express, runs experiments at high volume where the platform bill becomes a budget item, or operates in a regulated industry that needs full transparency into the math. New teams should almost always start on a platform; the warehouse-native infrastructure investment does not pay back at low volume. Read the experimentation-platform-orchestrator skill for the full decision framework.
- What is CUPED and why does the skill emphasize it?
- CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance in the metric by subtracting predictable variance based on each user's pre-experiment behavior. It typically reduces variance 30 to 50 percent on engagement metrics, equivalent to running the experiment 1.5 to 2x longer for the same statistical power. Worth the engineering investment for any team running 5+ experiments per quarter. The skill covers the math, the Python implementation, and when CUPED works well vs poorly.
- Why does the skill emphasize exposure logging discipline?
- The single most common warehouse-native experimentation bug. Firing exposure at page load instead of when the variant-specific UI renders means users who never saw the variant are still counted as exposed. The control and treatment groups both include non-exposed users; the analysis dilutes the real effect. The skill covers the right firing moment, single-fire enforcement, and the SRM check that catches assignment bugs early.
- How does this pair with the data and analytics integrations?
- The skill is platform-agnostic for the discipline. The /integrations pages cover the platform-specific MCP commands and SQL features. Open the skill for the strategic decisions; open the integration page for the platform-specific tactics: BigQuery's GoogleSQL syntax, Snowflake's Cortex Analyst for natural-language metrics, dbt's Semantic Layer for shared metric definitions, Hex's notebook integration for reproducible analysis. Mixpanel is intentionally omitted from this skill's pairings: it is an event analytics platform, not warehouse-native infrastructure.
- What about sequential testing and peeking?
- Looking at experiment results before completion inflates the false-positive rate. mSPRT, confidence sequences, and group sequential designs all allow valid early stopping but require careful implementation. The honest framing in the skill: if you do not understand the math, do not peek. Pre-register the sample size, run to completion, analyze once. Move to sequential testing only when the team has the statistical expertise to implement correctly and the operational pressure to peek.