Agent skill

ab-test-setup

Design and implement statistically rigorous A/B tests and experiments. Covers hypothesis formulation, sample size calculation, metric selection, traffic allocation, implementation patterns (client-side and server-side), statistical analysis, and common pitfalls. Use when planning experiments, calculating sample sizes, designing test variants, analyzing results, or when someone says "let's test that.

Stars 71
Forks 21

Install this agent skill to your Project

npx add-skill https://github.com/borghei/Claude-Skills/tree/main/product-team/ab-test-setup

Metadata

Additional technical details for this skill

tags
ab-testing experimentation hypothesis statistical-significance
author
borghei
domain
experimentation
updated
1773014400
version
1.0.0
category
product-team
frameworks
hypothesis-testing, statistical-significance, feature-flags

SKILL.md

A/B Test Setup - Experimentation Design & Analysis

Category: Product Team Tags: A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing

Overview

A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.


The Experiment Lifecycle

1. HYPOTHESIZE  →  2. DESIGN  →  3. CALCULATE  →  4. IMPLEMENT
       ↑                                                    │
       │                                                    ▼
7. ITERATE  ←  6. DOCUMENT  ←  5. ANALYZE  ←  [Run to completion]

Step 1: Hypothesis Formulation

The Hypothesis Template

Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].

We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.

Good vs Bad Hypotheses

Quality Hypothesis Problem
Bad "Changing the button color might increase clicks" No data basis, no target, no measurement plan
Mediocre "A green button will get more clicks than blue" No "why", no target size, no guardrails
Good "Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s." Data-backed, specific change, measurable outcome, defined audience, guardrail

Hypothesis Sources (Where to Find Test Ideas)

Source What to Look For Example
Analytics data Drop-off points, low-performing pages "80% of users drop off at step 3 of onboarding"
User research Confusion, frustration, unmet needs "Users don't understand what the product does from the homepage"
Heatmaps/session recordings Ignored elements, rage clicks "Nobody scrolls past the fold on pricing page"
Support tickets Recurring complaints, feature confusion "Users constantly ask how to invite team members"
Competitor analysis Different approaches to same problem "Competitor uses a wizard; we use a form"
Sales objections Common reasons prospects don't convert "Prospects want to see pricing before signing up"

Step 2: Test Design

Test Types

Type Variants Traffic Need Best For
A/B 2 (control + 1 variant) Moderate Single change validation
A/B/n 3+ variants High Comparing multiple approaches
Multivariate (MVT) Combinations of changes Very high Optimizing multiple elements
Split URL Different pages Moderate Major redesigns
Bandit Dynamic allocation Low-moderate Revenue optimization

Default recommendation: Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.

What to Test (By Impact)

Category High Impact Medium Impact Low Impact
Copy Headline/value prop, CTA text Body copy, social proof Microcopy, labels
Design Page layout, above-fold content Visual hierarchy, imagery Color, font size
UX Number of steps, form fields Button placement, navigation Animations, transitions
Pricing Price point, plan names Feature packaging, anchoring Billing frequency display
Social Proof Testimonials vs none, logos Testimonial format, placement Testimonial count

Metric Selection

Every test needs three types of metrics:

Primary Metric (1 only)

  • The single metric that determines success
  • Directly tied to the hypothesis
  • Must be measurable within the test duration
  • Examples: signup rate, click-through rate, purchase rate

Secondary Metrics (2-3)

  • Explain why the primary metric moved
  • Provide context for decision-making
  • Examples: time on page, scroll depth, feature adoption rate

Guardrail Metrics (1-3)

  • Things that must NOT get worse
  • Stop the test if significantly negative
  • Examples: error rate, support ticket volume, page load time, refund rate

Step 3: Sample Size Calculation

Quick Reference Table

Minimum visitors PER VARIANT needed (95% confidence, 80% power):

Baseline Rate 5% Lift 10% Lift 15% Lift 20% Lift 50% Lift
1% 620,000 156,000 70,000 39,000 6,400
2% 305,000 77,000 34,000 19,500 3,200
3% 200,000 51,000 23,000 12,800 2,100
5% 116,000 29,500 13,200 7,500 1,250
10% 54,000 13,800 6,200 3,500 600
20% 24,000 6,200 2,800 1,600 280
50% 6,100 1,600 720 410 75

Duration Calculation

Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page

Minimum duration: 7 days (to capture day-of-week effects) Maximum recommended: 6 weeks (beyond this, external factors contaminate results)

What If You Don't Have Enough Traffic?

Situation Solution
Need 100K visitors, get 5K/week Increase minimum detectable effect (test bolder changes)
Very low traffic (<1K/week) Use qualitative testing (user testing, surveys) instead
Medium traffic (5-20K/week) Run for 4-6 weeks, test big changes only
High traffic (50K+/week) You can test subtle changes, run multiple tests

Step 4: Implementation

Client-Side Implementation

JavaScript modifies the page after initial render.

Pros: Quick to implement, no deploy needed Cons: Can cause flicker (flash of original content), blocked by ad blockers Tools: PostHog, Optimizely, VWO, Google Optimize

Anti-flicker pattern:

javascript
// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>

// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');

Server-Side Implementation

Variant determined before page renders. No flicker, no client-side dependency.

Pros: No flicker, not blocked by ad blockers, works for logged-in features Cons: Requires engineering work, deploy needed Tools: PostHog, LaunchDarkly, Split, Unleash, custom feature flags

Basic feature flag pattern:

python
# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
    # Deterministic hash ensures same user always sees same variant
    hash_input = f"{user_id}:{experiment}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % 100

    if bucket < 50:
        return "control"
    else:
        return "variant"

Traffic Allocation

Strategy Split When to Use
Standard 50/50 Default. Maximum statistical power.
Conservative 90/10 or 80/20 Risky changes, revenue-impacting tests
Ramped Start 95/5, increase to 50/50 New infrastructure, technical risk

Critical rules:

  • Users must see the same variant on every visit (sticky assignment by user ID or cookie)
  • Allocation must be balanced across time of day and day of week
  • Never change allocation mid-test

Step 5: Running the Test

Pre-Launch Checklist

  • Hypothesis documented with primary metric and minimum detectable effect
  • Sample size calculated, expected duration estimated
  • Both variants implemented and QA'd on all device types
  • Tracking verified (events fire correctly for both variants)
  • No other tests running on the same page/feature
  • Stakeholders informed of test duration and "no peeking" rule
  • External factor calendar checked (no major launches, holidays, press)

During the Test

DO:

  • Monitor for technical errors (variant not rendering, tracking broken)
  • Check that traffic split is balanced daily
  • Document any external events that might affect results

DO NOT:

  • Look at results before reaching sample size ("peeking problem")
  • Make changes to either variant
  • Add traffic from new sources mid-test
  • Stop the test early because one variant "looks like it's winning"

The Peeking Problem (Critical)

Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a 25-40% false positive rate (vs the intended 5%).

Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.

Solutions:

  1. Pre-commit to sample size and do not check results until reached
  2. If you must monitor: use sequential testing methods (group sequential design, always-valid p-values)
  3. Set calendar reminder for expected completion date -- that is when you look

Step 6: Analysis

Analysis Checklist

  1. Did we reach planned sample size? If not, results are preliminary only.
  2. Is it statistically significant? p < 0.05 = 95% confidence the difference is real.
  3. What's the confidence interval? Tells you the range of likely true effect.
  4. Is the effect size meaningful? A 0.1% lift that's "significant" may not be worth implementing.
  5. Are secondary metrics consistent? Do they support the primary result?
  6. Any guardrail violations? Did anything get worse?
  7. Segment analysis: Different results for mobile vs desktop? New vs returning?

Interpreting Results

Result Primary Metric Confidence Action
Clear winner Variant +15%, p < 0.01 High Implement variant
Modest winner Variant +5%, p < 0.05 Medium Implement if easy, else run longer
Flat < 2% difference, p > 0.20 High (no effect) Keep control, test something bolder
Loser Variant -10%, p < 0.05 High Keep control, investigate why
Inconclusive 5% difference, p = 0.08 Low Need more traffic or bolder test
Mixed signals Primary up, guardrail down Investigate Dig into segments, do not ship blindly

Common Analysis Mistakes

Mistake Consequence Prevention
Stopping at first significance 25-40% false positive rate Commit to sample size
Cherry-picking segments Finding "winners" that don't replicate Pre-register segments of interest
Ignoring confidence intervals Overestimating effect size Always report CI alongside p-value
Multiple comparisons Inflated Type I error Bonferroni correction for A/B/n
Survivorship bias Only analyzing users who completed flow Include all users from assignment point
Simpson's paradox Aggregate hides segment reversal Always check key segments

Step 7: Documentation

Every test must be documented, regardless of outcome.

Test Documentation Template

EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]

HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].

VARIANTS:
- Control: [description]
- Variant: [description + screenshot]

METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]

RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]

DECISION: [Ship variant / Keep control / Iterate]

LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]

Experiment Prioritization Framework

ICE Scoring

Factor Score (1-10) Question
Impact How much will this move the metric? Big change to primary KPI = 10
Confidence How sure are we it will work? Strong data supporting hypothesis = 10
Ease How easy is it to implement and measure? Can ship in a day = 10

ICE Score = (Impact + Confidence + Ease) / 3

Rank all test ideas by ICE score. Run highest first.

Test Backlog Template

# Hypothesis Primary Metric ICE Est. Duration Status
1 Larger CTA increases signups Signup rate 8.3 2 weeks Ready
2 Social proof on pricing increases conversion Plan selection rate 7.0 3 weeks Needs design
3 Shorter onboarding increases activation Feature activation 6.7 4 weeks In backlog

Proactive Triggers

  • Someone debates between two design options: propose an A/B test instead of opinionating
  • Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions
  • Pricing page changes discussed: always test pricing changes with guardrail metrics
  • Post-launch of any feature: propose follow-up experiment to optimize
  • "Let's just try it and see": redirect to structured hypothesis before implementation

Related Skills

Skill Use When
analytics-tracking Setting up event tracking that feeds experiment metrics
campaign-analytics Folding experiment results into broader attribution
launch-strategy Testing within a product launch sequence
prompt-engineer-toolkit A/B testing AI prompts in production

Tool Reference

sample_size_calculator.py

Calculates required sample size per variant using the normal approximation to the two-proportion z-test. Includes Bonferroni correction for multi-variant tests and duration estimation.

Flag Type Default Description
--baseline, -b float (required) Baseline conversion rate (e.g. 0.05 for 5%)
--mde, -m float (required) Minimum detectable effect as relative lift (e.g. 0.10 for 10%)
--alpha, -a float 0.05 Significance level
--power, -p float 0.80 Statistical power
--variants, -v int 2 Number of variants including control
--daily-traffic, -d int 0 Daily eligible traffic for duration estimation
--one-tailed flag False Use one-tailed test instead of two-tailed
--json flag False Output as JSON
bash
python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10
python scripts/sample_size_calculator.py --baseline 0.12 --mde 0.15 --power 0.9 --daily-traffic 5000
python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10 --variants 3 --json

experiment_analyzer.py

Analyzes A/B test results using the two-proportion z-test with confidence intervals and segment breakdown.

Flag Type Default Description
input positional (required) CSV file with results or "sample" to create sample
--alpha, -a float 0.05 Significance level
--json flag False Output as JSON

CSV format: variant,visitors,conversions,segment

bash
python scripts/experiment_analyzer.py sample
python scripts/experiment_analyzer.py results.csv
python scripts/experiment_analyzer.py results.csv --alpha 0.01 --json

experiment_planner.py

Generates a structured experiment plan from a hypothesis text, including metric selection, sample size, timeline, risks, and documentation template.

Flag Type Default Description
--hypothesis, -H string (required) Experiment hypothesis text
--baseline, -b float 0.05 Baseline conversion rate
--mde, -m float 0.10 Minimum detectable effect as relative lift
--daily-traffic, -d int 0 Daily eligible traffic
--variants, -v int 2 Number of variants including control
--json flag False Output as JSON
bash
python scripts/experiment_planner.py --hypothesis "Larger CTA will increase signups by 15%"
python scripts/experiment_planner.py -H "Simplified checkout boosts conversions" -b 0.08 -m 0.15 -d 3000
python scripts/experiment_planner.py -H "New pricing page" --json

Troubleshooting

Problem Cause Solution
Sample size is unrealistically large MDE too small or baseline too low Increase MDE (test bolder changes) or target a higher-traffic page
Test duration exceeds 6 weeks Insufficient daily traffic Consider qualitative methods, test bigger changes, or combine traffic from multiple pages
p-value hovers around 0.05 Borderline significance Do not stop early; run to planned sample size or extend 20%
Results significant but lift is tiny (<1%) Overpowered test Check practical significance alongside statistical significance
Segment results contradict overall Simpson's paradox Investigate segment composition; report both overall and segment results
Variant performs differently on mobile vs desktop Device-specific UX issues Design device-specific variants; increase per-segment sample size
Calculator produces negative CI Very small samples or extreme rates Ensure sufficient sample size; check data integrity

Success Criteria

Criterion Target How to Measure
Tests reach planned sample size 100% of tests Compare actual vs planned sample at conclusion
False positive rate <5% Track post-implementation lift vs test prediction
Test velocity 2+ tests per team per month Count experiments documented per sprint
Documentation completeness 100% of tests documented Audit experiment records quarterly
Average test duration <4 weeks Measure start-to-conclusion calendar days
Decision quality >80% of shipped variants hold gains at 90 days Post-ship metric tracking

Scope & Limitations

In scope:

  • Hypothesis formulation and validation
  • Sample size and power calculations
  • Frequentist two-proportion z-tests
  • A/B, A/B/n, and split URL test planning
  • Segment-level analysis
  • Pre/post test documentation

Out of scope:

  • Bayesian A/B testing methods (use dedicated Bayesian tools)
  • Multi-armed bandit algorithms (require real-time allocation infrastructure)
  • Multivariate testing (MVT) analysis (combinatorial explosion requires specialized tools)
  • Server-side feature flag implementation (see engineering skills)
  • Revenue-based metrics requiring transaction-level data
  • Sequential testing / always-valid p-values (use Optimizely Stats Engine or similar)

Integration Points

Tool / Platform Integration Method Use Case
PostHog / Amplitude JSON export from experiment_analyzer Feed results into product analytics
Jira / Linear experiment_planner JSON output Create experiment tickets with metadata
Google Sheets CSV export from experiment_analyzer Share results with non-technical stakeholders
LaunchDarkly / Unleash experiment_planner checklist Pre-launch validation before feature flag rollout
Slack / Notion Copy human-readable output Async experiment status updates
CI/CD pipelines --json flag on all scripts Automated experiment health checks

Expand your agent's capabilities with these related and highly-rated skills.

borghei/Claude-Skills

churn-prevention

SaaS churn reduction covering cancel flow design, dynamic save offers, exit survey architecture, dunning sequences, payment recovery, win-back campaigns, and churn impact modeling.

71 21
Explore
borghei/Claude-Skills

popup-cro

Popup and modal optimization for conversion. Covers exit-intent, slide-ins, banners, timing optimization, frequency capping, audience targeting, compliance, and A/B testing frameworks for lead capture, promotions, and announcements.

71 21
Explore
borghei/Claude-Skills

competitor-alternatives

Competitor comparison and alternative page creation for SEO and sales enablement. Covers 4 page formats (singular alternative, plural alternatives, vs pages, competitor vs competitor), content architecture, research methodology, and centralized competitor data management.

71 21
Explore
borghei/Claude-Skills

contract-and-proposal-writer

Generate production-ready business documents including freelance contracts, project proposals, SOWs, NDAs, and MSAs with jurisdiction-aware clauses. Covers US (Delaware), EU (GDPR), UK, and DACH (German law) legal frameworks. Includes contract templates, clause libraries, and DOCX conversion. Use when starting client engagements, writing proposals, drafting partnership agreements, or needing GDPR-compliant data processing addenda.

71 21
Explore
borghei/Claude-Skills

pricing-strategy

SaaS pricing design and optimization covering value metric selection, tier architecture, price point research, pricing page design, price increase execution, and competitive pricing analysis.

71 21
Explore
borghei/Claude-Skills

referral-program

Referral and affiliate program design covering referral loop architecture, incentive design, trigger moment optimization, viral coefficient modeling, affiliate program structure, and optimization playbook.

71 21
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results