@alirezarezvani · Research

name: experiment-designer description: Use when planning product experiments, writing testable hypotheses, estimating sample size, prioritizing tests, or interpreting A/B outcomes with practical statistical rigor.

Experiment Designer

Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.

When To Use

Use this skill for:

A/B and multivariate experiment planning
Hypothesis writing and success criteria definition
Sample size and minimum detectable effect planning
Experiment prioritization with ICE scoring
Reading statistical output for product decisions

Core Workflow

Write hypothesis in If/Then/Because format

If we change [intervention]
Then [metric] will change by [expected direction/magnitude]
Because [behavioral mechanism]

Define metrics before running test

Primary metric: single decision metric
Guardrail metrics: quality/risk protection
Secondary metrics: diagnostics only

Estimate sample size

Baseline conversion or baseline mean
Minimum detectable effect (MDE)
Significance level (alpha) and power

Use:

python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute

Prioritize experiments with ICE

Impact: potential upside
Confidence: evidence quality
Ease: cost/speed/complexity

ICE Score = (Impact * Confidence * Ease) / 10

Launch with stopping rules

Decide fixed sample size or fixed duration in advance
Avoid repeated peeking without proper method
Monitor guardrails continuously

Interpret results

Statistical significance is not business significance
Compare point estimate + confidence interval to decision threshold
Investigate novelty effects and segment heterogeneity

Hypothesis Quality Checklist

Contains explicit intervention and audience
Specifies measurable metric change
States plausible causal reason
Includes expected minimum effect
Defines failure condition

Common Experiment Pitfalls

Underpowered tests leading to false negatives
Running too many simultaneous changes without isolation
Changing targeting or implementation mid-test
Stopping early on random spikes
Ignoring sample ratio mismatch and instrumentation drift
Declaring success from p-value without effect-size context

Statistical Interpretation Guardrails

p-value < alpha indicates evidence against null, not guaranteed truth.
Confidence interval crossing zero/no-effect means uncertain directional claim.
Wide intervals imply low precision even when significant.
Use practical significance thresholds tied to business impact.

See:

references/experiment-playbook.md
references/statistics-reference.md

Tooling

`scripts/sample_size_calculator.py`

Computes required sample size (per variant and total) from:

baseline rate
MDE (absolute or relative)
significance level (alpha)
statistical power

Example:

python3 scripts/sample_size_calculator.py \
  --baseline-rate 0.10 \
  --mde 0.015 \
  --mde-type absolute \
  --alpha 0.05 \
  --power 0.8

#!/usr/bin/env python3
"""Calculate sample size for two-proportion A/B tests."""

import argparse
import math
import statistics


def clamp_rate(value: float, name: str) -> float:
    if value <= 0 or value >= 1:
        raise ValueError(f"{name} must be between 0 and 1 (exclusive).")
    return value


def required_sample_size_per_group(
    baseline_rate: float,
    target_rate: float,
    alpha: float,
    power: float,
) -> int:
    delta = abs(target_rate - baseline_rate)
    if delta <= 0:
        raise ValueError("MDE resolves to zero; target and baseline must differ.")

    z_alpha = statistics.NormalDist().inv_cdf(1 - alpha / 2)
    z_beta = statistics.NormalDist().inv_cdf(power)
    pooled = (baseline_rate + target_rate) / 2

    numerator = 2 * pooled * (1 - pooled) * (z_alpha + z_beta) ** 2
    n = numerator / (delta ** 2)
    return math.ceil(n)


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Compute sample size for two-proportion product experiments."
    )
    parser.add_argument("--baseline-rate", type=float, required=True)
    parser.add_argument(
        "--mde",
        type=float,
        required=True,
        help="Minimum detectable effect. Absolute points when --mde-type absolute, otherwise relative uplift.",
    )
    parser.add_argument("--mde-type", choices=["absolute", "relative"], default="relative")
    parser.add_argument("--alpha", type=float, default=0.05)
    parser.add_argument("--power", type=float, default=0.8)
    parser.add_argument(
        "--daily-samples",
        type=int,
        default=0,
        help="Optional total daily samples to estimate runtime in days.",
    )
    return parser.parse_args()


def main() -> int:
    args = parse_args()
    baseline = clamp_rate(args.baseline_rate, "baseline-rate")

    if args.mde <= 0:
        raise ValueError("mde must be > 0")
    if args.alpha <= 0 or args.alpha >= 1:
        raise ValueError("alpha must be between 0 and 1")
    if args.power <= 0 or args.power >= 1:
        raise ValueError("power must be between 0 and 1")

    if args.mde_type == "absolute":
        target = baseline + args.mde
    else:
        target = baseline * (1 + args.mde)

    target = clamp_rate(target, "target-rate")

    n_per_group = required_sample_size_per_group(
        baseline_rate=baseline,
        target_rate=target,
        alpha=args.alpha,
        power=args.power,
    )
    total_n = n_per_group * 2

    print("A/B Test Sample Size Estimate")
    print(f"baseline_rate: {baseline:.6f}")
    print(f"target_rate: {target:.6f}")
    print(f"mde_type: {args.mde_type}")
    print(f"alpha: {args.alpha}")
    print(f"power: {args.power}")
    print(f"n_per_group: {n_per_group}")
    print(f"n_total: {total_n}")

    if args.daily_samples > 0:
        days = math.ceil(total_n / args.daily_samples)
        print(f"estimated_days_at_daily_samples_{args.daily_samples}: {days}")

    return 0


if __name__ == "__main__":
    raise SystemExit(main())

Experiment Designer

name: experiment-designer description: Use when planning product experiments, writing testable hypotheses, estimating sample size, prioritizing tests, or interpreting A/B outcomes with practical statistical rigor.

Experiment Designer

When To Use

Core Workflow

Hypothesis Quality Checklist

Common Experiment Pitfalls

Statistical Interpretation Guardrails

Tooling

`scripts/sample_size_calculator.py`

Experiment Playbook

Experiment Types

A/B Test

Multivariate Test

Holdout Test

Metric Design

Primary Metric

Guardrail Metrics

Diagnostic Metrics

Stopping Rules

Novelty and Primacy Effects

Pre-Launch Checklist

Post-Test Readout Template

Statistics Reference for Product Managers

p-value

Confidence Interval (CI)

Minimum Detectable Effect (MDE)

Statistical Power

Type I and Type II Errors

Practical Significance

Power Analysis Inputs

Install this Skill

Works with

Details

People who install this also use