A/B Testing Sample Size Calculator for Deep Links

Running an A/B test without calculating sample size first is like measuring with a broken ruler. You'll get a number, but it won't mean anything. Most deep link A/B tests require more traffic than teams expect, and stopping early leads to false conclusions. This guide explains how to calculate the right sample size, what affects it, and how to work around low-traffic situations.

For understanding statistical significance, see Statistical Significance for A/B Tests: What It Means. For measuring A/B test results, see Measuring A/B Test Results for Deep Link Campaigns.

Tolinku A/B testing dashboard for smart banners The A/B tests list page showing test names, status, types, and variant counts.

The Formula

Standard Sample Size Calculation

For a two-proportion z-test (comparing conversion rates between two variants):

function calculateSampleSize(baselineRate, minimumDetectableEffect, significance, power) {
  // Default: 95% significance (alpha = 0.05), 80% power (beta = 0.20)
  const alpha = significance || 0.05;
  const beta = 1 - (power || 0.80);

  // Z-scores
  const zAlpha = getZScore(1 - alpha / 2); // 1.96 for 95%
  const zBeta = getZScore(1 - beta);       // 0.84 for 80%

  const p1 = baselineRate;
  const p2 = baselineRate * (1 + minimumDetectableEffect);
  const pAvg = (p1 + p2) / 2;

  const numerator = Math.pow(
    zAlpha * Math.sqrt(2 * pAvg * (1 - pAvg)) +
    zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)),
    2
  );
  const denominator = Math.pow(p2 - p1, 2);

  return Math.ceil(numerator / denominator);
}

function getZScore(percentile) {
  // Approximation using the inverse error function
  if (percentile === 0.975) return 1.96;
  if (percentile === 0.80) return 0.842;
  if (percentile === 0.90) return 1.282;
  if (percentile === 0.95) return 1.645;
  if (percentile === 0.995) return 2.576;
  // For other values, use a lookup table or library
  return 1.96;
}

What the Variables Mean

Variable	What It Is	Typical Value
Baseline rate	Your current conversion rate	Varies (e.g., 5% CTR)
Minimum detectable effect (MDE)	The smallest improvement worth detecting	10-20% relative lift
Significance level (alpha)	Probability of a false positive	0.05 (95% confidence)
Power (1 – beta)	Probability of detecting a real effect	0.80 (80% power)

Sample Size Reference Table

Per Variant (Not Total)

These are samples needed per variant, not total. For a 2-variant test, double the number.

Baseline Rate	5% MDE	10% MDE	15% MDE	20% MDE	30% MDE
1%	3.1M	780K	347K	196K	87K
2%	1.5M	383K	171K	96K	43K
5%	592K	149K	66K	38K	17K
10%	281K	71K	32K	18K	8K
20%	125K	32K	14K	8K	3.6K
30%	73K	19K	8K	5K	2.2K
50%	31K	8K	3.6K	2K	900

Key takeaway: lower baseline conversion rates require dramatically more traffic.

Deep Link Scenario Examples

Scenario	Baseline Rate	MDE	Sample Per Variant	At 1K clicks/day
Smart banner tap rate	3%	15%	~140K	140 days
Landing page install rate	8%	10%	~45K	45 days
Email deep link CTR	15%	10%	~20K	20 days
CTA button click rate	25%	15%	~5K	5 days
In-app upgrade prompt	2%	20%	~95K	95 days

Common Mistakes

1. Stopping Too Early

The most common mistake. You see one variant "winning" after 3 days and declare victory.

// BAD: Checking results daily and stopping when significant
function checkResultsDaily(experiment) {
  const results = getResults(experiment.id);
  if (results.pValue < 0.05) {
    declareWinner(results.leadingVariant); // Don't do this
  }
}

// GOOD: Pre-calculate when to check
function setupExperiment(experiment) {
  const sampleSize = calculateSampleSize(
    experiment.baselineRate,
    experiment.mde,
    0.05,
    0.80
  );

  return {
    ...experiment,
    requiredSamplePerVariant: sampleSize,
    estimatedDays: Math.ceil(sampleSize * experiment.variants.length / experiment.dailyTraffic),
    checkpoints: [0.5, 0.75, 1.0].map(p => Math.ceil(sampleSize * p)),
  };
}

Why early stopping is dangerous: with enough "peeking," random fluctuations will appear significant. If you check results after every 100 visitors, there's up to a 30% chance of a false positive (not 5%).

2. Not Accounting for Multiple Variants

Testing 4 variants instead of 2 requires a correction:

function adjustedSampleSize(baseSize, numVariants) {
  // Bonferroni correction: divide alpha by number of comparisons
  const numComparisons = numVariants * (numVariants - 1) / 2;
  const adjustedAlpha = 0.05 / numComparisons;

  // Recalculate with stricter alpha
  // This typically increases sample size by 20-50%
  return Math.ceil(baseSize * Math.log(1 / adjustedAlpha) / Math.log(1 / 0.05));
}

Variants	Comparisons	Sample Size Multiplier
2	1	1.0x
3	3	~1.3x
4	6	~1.5x
5	10	~1.6x

3. Using the Wrong Baseline Rate

Your baseline should come from at least 2-4 weeks of data. Don't use a single day's data or an estimate from a different product.

async function getReliableBaseline(metric, minDays = 14) {
  const data = await getMetricData(metric, { days: minDays });

  return {
    rate: data.totalConversions / data.totalImpressions,
    sampleSize: data.totalImpressions,
    dateRange: data.dateRange,
    dayOfWeekVariation: data.stdDevByDayOfWeek,
    isReliable: data.totalImpressions > 1000 && data.days >= minDays,
  };
}

4. Ignoring Day-of-Week Effects

Deep link traffic patterns differ between weekdays and weekends. Always run tests for full weeks:

function calculateMinimumRuntime(dailyTraffic, requiredSample, variants) {
  const totalRequired = requiredSample * variants;
  const daysNeeded = Math.ceil(totalRequired / dailyTraffic);

  // Round up to complete weeks
  const weeksNeeded = Math.ceil(daysNeeded / 7);
  const adjustedDays = weeksNeeded * 7;

  // Minimum 14 days regardless
  return Math.max(adjustedDays, 14);
}

Working with Low Traffic

Reduce the Number of Variants

Fewer variants means faster tests:

// Instead of testing 4 CTA variants at once
const slowTest = {
  variants: ['Download Free', 'Get Started', 'Try It Now', 'Install'],
  samplePerVariant: 38000,
  totalRequired: 152000, // At 500/day = 304 days
};

// Test 2 at a time, iterate faster
const fastTest = {
  round1: { variants: ['Download Free', 'Get Started'], totalRequired: 76000 }, // 152 days
  round2: { variants: ['round1_winner', 'Try It Now'], totalRequired: 76000 },
  // Total time is longer but you get actionable results sooner
};

Increase the MDE

If you only care about large improvements (20%+ lift), you need fewer samples:

// Detecting a 10% lift on 5% baseline: 149K per variant
const conservativeTest = calculateSampleSize(0.05, 0.10, 0.05, 0.80);

// Detecting a 20% lift on 5% baseline: 38K per variant
const aggressiveTest = calculateSampleSize(0.05, 0.20, 0.05, 0.80);

// Detecting a 30% lift on 5% baseline: 17K per variant
const largeEffectTest = calculateSampleSize(0.05, 0.30, 0.05, 0.80);

The tradeoff: you might miss real improvements of 10-19% because you weren't powered to detect them.

Lower Confidence Level

Use 90% confidence instead of 95% for exploratory tests:

// 95% confidence: 149K per variant
const standard = calculateSampleSize(0.05, 0.10, 0.05, 0.80);

// 90% confidence: 109K per variant (27% fewer)
const exploratory = calculateSampleSize(0.05, 0.10, 0.10, 0.80);

Use 90% for initial discovery (which variants are worth testing further) and 95% for final decisions.

Combine Small Tests into Larger Ones

If you have multiple low-traffic deep link campaigns, pool them:

// Bad: Run separate tests on 5 campaigns with 200 clicks/day each
// 38K / 200 = 190 days per test

// Better: Run one test across all campaigns with 1000 clicks/day total
// 38K / 1000 = 38 days
// But only if the effect is expected to be consistent across campaigns

Pre-Test Checklist

Before starting any A/B test:

function preTestChecklist(config) {
  const checks = [];

  // 1. Baseline rate available?
  const baseline = getBaseline(config.metric, 14);
  checks.push({
    name: 'Reliable baseline',
    pass: baseline.isReliable,
    value: baseline.rate,
  });

  // 2. Sample size calculated?
  const sampleSize = calculateSampleSize(baseline.rate, config.mde, 0.05, 0.80);
  checks.push({
    name: 'Sample size calculated',
    pass: true,
    value: sampleSize,
  });

  // 3. Enough traffic?
  const dailyTraffic = getDailyTraffic(config.route);
  const daysNeeded = Math.ceil(sampleSize * config.variants.length / dailyTraffic);
  checks.push({
    name: 'Runtime feasible',
    pass: daysNeeded <= 60,
    value: `${daysNeeded} days at ${dailyTraffic}/day`,
  });

  // 4. Full weeks?
  const runtime = Math.max(Math.ceil(daysNeeded / 7) * 7, 14);
  checks.push({
    name: 'Full week runtime',
    pass: true,
    value: `${runtime} days (${runtime / 7} weeks)`,
  });

  // 5. No conflicting tests?
  const conflicts = getConflictingExperiments(config.route);
  checks.push({
    name: 'No conflicts',
    pass: conflicts.length === 0,
    value: conflicts.length === 0 ? 'None' : conflicts.map(c => c.id).join(', '),
  });

  return checks;
}

Runtime Calculator

function estimateRuntime(params) {
  const {
    baselineRate,
    minimumDetectableEffect,
    dailyTraffic,
    numVariants = 2,
    significance = 0.05,
    power = 0.80,
  } = params;

  const samplePerVariant = calculateSampleSize(
    baselineRate, minimumDetectableEffect, significance, power
  );
  const totalSample = samplePerVariant * numVariants;
  const rawDays = Math.ceil(totalSample / dailyTraffic);
  const fullWeeks = Math.max(Math.ceil(rawDays / 7) * 7, 14);

  return {
    samplePerVariant,
    totalSample,
    rawDays,
    recommendedDays: fullWeeks,
    recommendedWeeks: fullWeeks / 7,
  };
}

// Example: Testing smart banner tap rate
const estimate = estimateRuntime({
  baselineRate: 0.03,       // 3% tap rate
  minimumDetectableEffect: 0.15, // Detect 15% relative lift
  dailyTraffic: 2000,       // 2000 banner impressions/day
  numVariants: 2,
});

console.log(estimate);
// {
//   samplePerVariant: ~140000,
//   totalSample: ~280000,
//   rawDays: 140,
//   recommendedDays: 140,
//   recommendedWeeks: 20,
// }

When Not to Use Sample Size Calculations

Bandit Algorithms

For optimization (not measurement), multi-armed bandit algorithms don't need fixed sample sizes. They continuously allocate more traffic to better-performing variants:

function banditAllocation(variants) {
  const total = variants.reduce((sum, v) => sum + v.successes + v.failures, 0);

  return variants.map(v => {
    const successes = v.successes + 1; // Beta prior
    const failures = v.failures + 1;
    const mean = successes / (successes + failures);
    const variance = (successes * failures) / (Math.pow(successes + failures, 2) * (successes + failures + 1));

    return {
      variantId: v.id,
      weight: mean + Math.sqrt(2 * Math.log(total) / (successes + failures)), // UCB1
    };
  });
}

Use bandits when you want to minimize regret (lost conversions during the test) rather than measure a precise lift.

Very High Traffic

If you have millions of daily impressions, any reasonable test will reach significance quickly. Focus on preventing false positives from multiple comparisons rather than worrying about sample size.

Best Practices

Calculate sample size before starting: Never run a test without knowing how long it needs to run.
Use your own baseline data: Don't use industry benchmarks. Measure your actual conversion rate for 2+ weeks.
Plan for realistic MDE: If a 5% lift wouldn't change your business, don't power for it. Use 10-20% MDE for most deep link tests.
Run for full weeks: Day-of-week effects are real. Always complete full 7-day cycles.
Don't peek constantly: Pre-set checkpoints at 50%, 75%, and 100% of required sample. Only evaluate at those points.
Document everything: Record your baseline, MDE, sample size, and runtime before the test starts.

For A/B testing features, see Tolinku A/B testing. For understanding test results, see the A/B testing results documentation.

A/B Testing Analytics conversions Deep Linking experimentation mobile-development optimization statistics

Get deep linking tips in your inbox

One email per week. No spam.