Running an A/B test without calculating sample size first is like measuring with a broken ruler. You'll get a number, but it won't mean anything. Most deep link A/B tests require more traffic than teams expect, and stopping early leads to false conclusions. This guide explains how to calculate the right sample size, what affects it, and how to work around low-traffic situations.
For understanding statistical significance, see Statistical Significance for A/B Tests: What It Means. For measuring A/B test results, see Measuring A/B Test Results for Deep Link Campaigns.
The A/B tests list page showing test names, status, types, and variant counts.
The Formula
Standard Sample Size Calculation
For a two-proportion z-test (comparing conversion rates between two variants):
function calculateSampleSize(baselineRate, minimumDetectableEffect, significance, power) {
// Default: 95% significance (alpha = 0.05), 80% power (beta = 0.20)
const alpha = significance || 0.05;
const beta = 1 - (power || 0.80);
// Z-scores
const zAlpha = getZScore(1 - alpha / 2); // 1.96 for 95%
const zBeta = getZScore(1 - beta); // 0.84 for 80%
const p1 = baselineRate;
const p2 = baselineRate * (1 + minimumDetectableEffect);
const pAvg = (p1 + p2) / 2;
const numerator = Math.pow(
zAlpha * Math.sqrt(2 * pAvg * (1 - pAvg)) +
zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)),
2
);
const denominator = Math.pow(p2 - p1, 2);
return Math.ceil(numerator / denominator);
}
function getZScore(percentile) {
// Approximation using the inverse error function
if (percentile === 0.975) return 1.96;
if (percentile === 0.80) return 0.842;
if (percentile === 0.90) return 1.282;
if (percentile === 0.95) return 1.645;
if (percentile === 0.995) return 2.576;
// For other values, use a lookup table or library
return 1.96;
}
What the Variables Mean
| Variable | What It Is | Typical Value |
|---|---|---|
| Baseline rate | Your current conversion rate | Varies (e.g., 5% CTR) |
| Minimum detectable effect (MDE) | The smallest improvement worth detecting | 10-20% relative lift |
| Significance level (alpha) | Probability of a false positive | 0.05 (95% confidence) |
| Power (1 – beta) | Probability of detecting a real effect | 0.80 (80% power) |
Sample Size Reference Table
Per Variant (Not Total)
These are samples needed per variant, not total. For a 2-variant test, double the number.
| Baseline Rate | 5% MDE | 10% MDE | 15% MDE | 20% MDE | 30% MDE |
|---|---|---|---|---|---|
| 1% | 3.1M | 780K | 347K | 196K | 87K |
| 2% | 1.5M | 383K | 171K | 96K | 43K |
| 5% | 592K | 149K | 66K | 38K | 17K |
| 10% | 281K | 71K | 32K | 18K | 8K |
| 20% | 125K | 32K | 14K | 8K | 3.6K |
| 30% | 73K | 19K | 8K | 5K | 2.2K |
| 50% | 31K | 8K | 3.6K | 2K | 900 |
Key takeaway: lower baseline conversion rates require dramatically more traffic.
Deep Link Scenario Examples
| Scenario | Baseline Rate | MDE | Sample Per Variant | At 1K clicks/day |
|---|---|---|---|---|
| Smart banner tap rate | 3% | 15% | ~140K | 140 days |
| Landing page install rate | 8% | 10% | ~45K | 45 days |
| Email deep link CTR | 15% | 10% | ~20K | 20 days |
| CTA button click rate | 25% | 15% | ~5K | 5 days |
| In-app upgrade prompt | 2% | 20% | ~95K | 95 days |
Common Mistakes
1. Stopping Too Early
The most common mistake. You see one variant "winning" after 3 days and declare victory.
// BAD: Checking results daily and stopping when significant
function checkResultsDaily(experiment) {
const results = getResults(experiment.id);
if (results.pValue < 0.05) {
declareWinner(results.leadingVariant); // Don't do this
}
}
// GOOD: Pre-calculate when to check
function setupExperiment(experiment) {
const sampleSize = calculateSampleSize(
experiment.baselineRate,
experiment.mde,
0.05,
0.80
);
return {
...experiment,
requiredSamplePerVariant: sampleSize,
estimatedDays: Math.ceil(sampleSize * experiment.variants.length / experiment.dailyTraffic),
checkpoints: [0.5, 0.75, 1.0].map(p => Math.ceil(sampleSize * p)),
};
}
Why early stopping is dangerous: with enough "peeking," random fluctuations will appear significant. If you check results after every 100 visitors, there's up to a 30% chance of a false positive (not 5%).
2. Not Accounting for Multiple Variants
Testing 4 variants instead of 2 requires a correction:
function adjustedSampleSize(baseSize, numVariants) {
// Bonferroni correction: divide alpha by number of comparisons
const numComparisons = numVariants * (numVariants - 1) / 2;
const adjustedAlpha = 0.05 / numComparisons;
// Recalculate with stricter alpha
// This typically increases sample size by 20-50%
return Math.ceil(baseSize * Math.log(1 / adjustedAlpha) / Math.log(1 / 0.05));
}
| Variants | Comparisons | Sample Size Multiplier |
|---|---|---|
| 2 | 1 | 1.0x |
| 3 | 3 | ~1.3x |
| 4 | 6 | ~1.5x |
| 5 | 10 | ~1.6x |
3. Using the Wrong Baseline Rate
Your baseline should come from at least 2-4 weeks of data. Don't use a single day's data or an estimate from a different product.
async function getReliableBaseline(metric, minDays = 14) {
const data = await getMetricData(metric, { days: minDays });
return {
rate: data.totalConversions / data.totalImpressions,
sampleSize: data.totalImpressions,
dateRange: data.dateRange,
dayOfWeekVariation: data.stdDevByDayOfWeek,
isReliable: data.totalImpressions > 1000 && data.days >= minDays,
};
}
4. Ignoring Day-of-Week Effects
Deep link traffic patterns differ between weekdays and weekends. Always run tests for full weeks:
function calculateMinimumRuntime(dailyTraffic, requiredSample, variants) {
const totalRequired = requiredSample * variants;
const daysNeeded = Math.ceil(totalRequired / dailyTraffic);
// Round up to complete weeks
const weeksNeeded = Math.ceil(daysNeeded / 7);
const adjustedDays = weeksNeeded * 7;
// Minimum 14 days regardless
return Math.max(adjustedDays, 14);
}
Working with Low Traffic
Reduce the Number of Variants
Fewer variants means faster tests:
// Instead of testing 4 CTA variants at once
const slowTest = {
variants: ['Download Free', 'Get Started', 'Try It Now', 'Install'],
samplePerVariant: 38000,
totalRequired: 152000, // At 500/day = 304 days
};
// Test 2 at a time, iterate faster
const fastTest = {
round1: { variants: ['Download Free', 'Get Started'], totalRequired: 76000 }, // 152 days
round2: { variants: ['round1_winner', 'Try It Now'], totalRequired: 76000 },
// Total time is longer but you get actionable results sooner
};
Increase the MDE
If you only care about large improvements (20%+ lift), you need fewer samples:
// Detecting a 10% lift on 5% baseline: 149K per variant
const conservativeTest = calculateSampleSize(0.05, 0.10, 0.05, 0.80);
// Detecting a 20% lift on 5% baseline: 38K per variant
const aggressiveTest = calculateSampleSize(0.05, 0.20, 0.05, 0.80);
// Detecting a 30% lift on 5% baseline: 17K per variant
const largeEffectTest = calculateSampleSize(0.05, 0.30, 0.05, 0.80);
The tradeoff: you might miss real improvements of 10-19% because you weren't powered to detect them.
Lower Confidence Level
Use 90% confidence instead of 95% for exploratory tests:
// 95% confidence: 149K per variant
const standard = calculateSampleSize(0.05, 0.10, 0.05, 0.80);
// 90% confidence: 109K per variant (27% fewer)
const exploratory = calculateSampleSize(0.05, 0.10, 0.10, 0.80);
Use 90% for initial discovery (which variants are worth testing further) and 95% for final decisions.
Combine Small Tests into Larger Ones
If you have multiple low-traffic deep link campaigns, pool them:
// Bad: Run separate tests on 5 campaigns with 200 clicks/day each
// 38K / 200 = 190 days per test
// Better: Run one test across all campaigns with 1000 clicks/day total
// 38K / 1000 = 38 days
// But only if the effect is expected to be consistent across campaigns
Pre-Test Checklist
Before starting any A/B test:
function preTestChecklist(config) {
const checks = [];
// 1. Baseline rate available?
const baseline = getBaseline(config.metric, 14);
checks.push({
name: 'Reliable baseline',
pass: baseline.isReliable,
value: baseline.rate,
});
// 2. Sample size calculated?
const sampleSize = calculateSampleSize(baseline.rate, config.mde, 0.05, 0.80);
checks.push({
name: 'Sample size calculated',
pass: true,
value: sampleSize,
});
// 3. Enough traffic?
const dailyTraffic = getDailyTraffic(config.route);
const daysNeeded = Math.ceil(sampleSize * config.variants.length / dailyTraffic);
checks.push({
name: 'Runtime feasible',
pass: daysNeeded <= 60,
value: `${daysNeeded} days at ${dailyTraffic}/day`,
});
// 4. Full weeks?
const runtime = Math.max(Math.ceil(daysNeeded / 7) * 7, 14);
checks.push({
name: 'Full week runtime',
pass: true,
value: `${runtime} days (${runtime / 7} weeks)`,
});
// 5. No conflicting tests?
const conflicts = getConflictingExperiments(config.route);
checks.push({
name: 'No conflicts',
pass: conflicts.length === 0,
value: conflicts.length === 0 ? 'None' : conflicts.map(c => c.id).join(', '),
});
return checks;
}
Runtime Calculator
function estimateRuntime(params) {
const {
baselineRate,
minimumDetectableEffect,
dailyTraffic,
numVariants = 2,
significance = 0.05,
power = 0.80,
} = params;
const samplePerVariant = calculateSampleSize(
baselineRate, minimumDetectableEffect, significance, power
);
const totalSample = samplePerVariant * numVariants;
const rawDays = Math.ceil(totalSample / dailyTraffic);
const fullWeeks = Math.max(Math.ceil(rawDays / 7) * 7, 14);
return {
samplePerVariant,
totalSample,
rawDays,
recommendedDays: fullWeeks,
recommendedWeeks: fullWeeks / 7,
};
}
// Example: Testing smart banner tap rate
const estimate = estimateRuntime({
baselineRate: 0.03, // 3% tap rate
minimumDetectableEffect: 0.15, // Detect 15% relative lift
dailyTraffic: 2000, // 2000 banner impressions/day
numVariants: 2,
});
console.log(estimate);
// {
// samplePerVariant: ~140000,
// totalSample: ~280000,
// rawDays: 140,
// recommendedDays: 140,
// recommendedWeeks: 20,
// }
When Not to Use Sample Size Calculations
Bandit Algorithms
For optimization (not measurement), multi-armed bandit algorithms don't need fixed sample sizes. They continuously allocate more traffic to better-performing variants:
function banditAllocation(variants) {
const total = variants.reduce((sum, v) => sum + v.successes + v.failures, 0);
return variants.map(v => {
const successes = v.successes + 1; // Beta prior
const failures = v.failures + 1;
const mean = successes / (successes + failures);
const variance = (successes * failures) / (Math.pow(successes + failures, 2) * (successes + failures + 1));
return {
variantId: v.id,
weight: mean + Math.sqrt(2 * Math.log(total) / (successes + failures)), // UCB1
};
});
}
Use bandits when you want to minimize regret (lost conversions during the test) rather than measure a precise lift.
Very High Traffic
If you have millions of daily impressions, any reasonable test will reach significance quickly. Focus on preventing false positives from multiple comparisons rather than worrying about sample size.
Best Practices
- Calculate sample size before starting: Never run a test without knowing how long it needs to run.
- Use your own baseline data: Don't use industry benchmarks. Measure your actual conversion rate for 2+ weeks.
- Plan for realistic MDE: If a 5% lift wouldn't change your business, don't power for it. Use 10-20% MDE for most deep link tests.
- Run for full weeks: Day-of-week effects are real. Always complete full 7-day cycles.
- Don't peek constantly: Pre-set checkpoints at 50%, 75%, and 100% of required sample. Only evaluate at those points.
- Document everything: Record your baseline, MDE, sample size, and runtime before the test starts.
For A/B testing features, see Tolinku A/B testing. For understanding test results, see the A/B testing results documentation.
Get deep linking tips in your inbox
One email per week. No spam.