Skip to content
Tolinku
Tolinku
Sign In Start Free
App Growth · · 6 min read

A/B Testing Sample Size Calculator for Deep Links

By Tolinku Staff
|
Tolinku app growth strategies dashboard screenshot for growth blog posts

Running an A/B test without calculating sample size first is like measuring with a broken ruler. You'll get a number, but it won't mean anything. Most deep link A/B tests require more traffic than teams expect, and stopping early leads to false conclusions. This guide explains how to calculate the right sample size, what affects it, and how to work around low-traffic situations.

For understanding statistical significance, see Statistical Significance for A/B Tests: What It Means. For measuring A/B test results, see Measuring A/B Test Results for Deep Link Campaigns.

Tolinku A/B testing dashboard for smart banners The A/B tests list page showing test names, status, types, and variant counts.

The Formula

Standard Sample Size Calculation

For a two-proportion z-test (comparing conversion rates between two variants):

function calculateSampleSize(baselineRate, minimumDetectableEffect, significance, power) {
  // Default: 95% significance (alpha = 0.05), 80% power (beta = 0.20)
  const alpha = significance || 0.05;
  const beta = 1 - (power || 0.80);

  // Z-scores
  const zAlpha = getZScore(1 - alpha / 2); // 1.96 for 95%
  const zBeta = getZScore(1 - beta);       // 0.84 for 80%

  const p1 = baselineRate;
  const p2 = baselineRate * (1 + minimumDetectableEffect);
  const pAvg = (p1 + p2) / 2;

  const numerator = Math.pow(
    zAlpha * Math.sqrt(2 * pAvg * (1 - pAvg)) +
    zBeta * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)),
    2
  );
  const denominator = Math.pow(p2 - p1, 2);

  return Math.ceil(numerator / denominator);
}

function getZScore(percentile) {
  // Approximation using the inverse error function
  if (percentile === 0.975) return 1.96;
  if (percentile === 0.80) return 0.842;
  if (percentile === 0.90) return 1.282;
  if (percentile === 0.95) return 1.645;
  if (percentile === 0.995) return 2.576;
  // For other values, use a lookup table or library
  return 1.96;
}

What the Variables Mean

Variable What It Is Typical Value
Baseline rate Your current conversion rate Varies (e.g., 5% CTR)
Minimum detectable effect (MDE) The smallest improvement worth detecting 10-20% relative lift
Significance level (alpha) Probability of a false positive 0.05 (95% confidence)
Power (1 – beta) Probability of detecting a real effect 0.80 (80% power)

Sample Size Reference Table

Per Variant (Not Total)

These are samples needed per variant, not total. For a 2-variant test, double the number.

Baseline Rate 5% MDE 10% MDE 15% MDE 20% MDE 30% MDE
1% 3.1M 780K 347K 196K 87K
2% 1.5M 383K 171K 96K 43K
5% 592K 149K 66K 38K 17K
10% 281K 71K 32K 18K 8K
20% 125K 32K 14K 8K 3.6K
30% 73K 19K 8K 5K 2.2K
50% 31K 8K 3.6K 2K 900

Key takeaway: lower baseline conversion rates require dramatically more traffic.

Scenario Baseline Rate MDE Sample Per Variant At 1K clicks/day
Smart banner tap rate 3% 15% ~140K 140 days
Landing page install rate 8% 10% ~45K 45 days
Email deep link CTR 15% 10% ~20K 20 days
CTA button click rate 25% 15% ~5K 5 days
In-app upgrade prompt 2% 20% ~95K 95 days

Common Mistakes

1. Stopping Too Early

The most common mistake. You see one variant "winning" after 3 days and declare victory.

// BAD: Checking results daily and stopping when significant
function checkResultsDaily(experiment) {
  const results = getResults(experiment.id);
  if (results.pValue < 0.05) {
    declareWinner(results.leadingVariant); // Don't do this
  }
}

// GOOD: Pre-calculate when to check
function setupExperiment(experiment) {
  const sampleSize = calculateSampleSize(
    experiment.baselineRate,
    experiment.mde,
    0.05,
    0.80
  );

  return {
    ...experiment,
    requiredSamplePerVariant: sampleSize,
    estimatedDays: Math.ceil(sampleSize * experiment.variants.length / experiment.dailyTraffic),
    checkpoints: [0.5, 0.75, 1.0].map(p => Math.ceil(sampleSize * p)),
  };
}

Why early stopping is dangerous: with enough "peeking," random fluctuations will appear significant. If you check results after every 100 visitors, there's up to a 30% chance of a false positive (not 5%).

2. Not Accounting for Multiple Variants

Testing 4 variants instead of 2 requires a correction:

function adjustedSampleSize(baseSize, numVariants) {
  // Bonferroni correction: divide alpha by number of comparisons
  const numComparisons = numVariants * (numVariants - 1) / 2;
  const adjustedAlpha = 0.05 / numComparisons;

  // Recalculate with stricter alpha
  // This typically increases sample size by 20-50%
  return Math.ceil(baseSize * Math.log(1 / adjustedAlpha) / Math.log(1 / 0.05));
}
Variants Comparisons Sample Size Multiplier
2 1 1.0x
3 3 ~1.3x
4 6 ~1.5x
5 10 ~1.6x

3. Using the Wrong Baseline Rate

Your baseline should come from at least 2-4 weeks of data. Don't use a single day's data or an estimate from a different product.

async function getReliableBaseline(metric, minDays = 14) {
  const data = await getMetricData(metric, { days: minDays });

  return {
    rate: data.totalConversions / data.totalImpressions,
    sampleSize: data.totalImpressions,
    dateRange: data.dateRange,
    dayOfWeekVariation: data.stdDevByDayOfWeek,
    isReliable: data.totalImpressions > 1000 && data.days >= minDays,
  };
}

4. Ignoring Day-of-Week Effects

Deep link traffic patterns differ between weekdays and weekends. Always run tests for full weeks:

function calculateMinimumRuntime(dailyTraffic, requiredSample, variants) {
  const totalRequired = requiredSample * variants;
  const daysNeeded = Math.ceil(totalRequired / dailyTraffic);

  // Round up to complete weeks
  const weeksNeeded = Math.ceil(daysNeeded / 7);
  const adjustedDays = weeksNeeded * 7;

  // Minimum 14 days regardless
  return Math.max(adjustedDays, 14);
}

Working with Low Traffic

Reduce the Number of Variants

Fewer variants means faster tests:

// Instead of testing 4 CTA variants at once
const slowTest = {
  variants: ['Download Free', 'Get Started', 'Try It Now', 'Install'],
  samplePerVariant: 38000,
  totalRequired: 152000, // At 500/day = 304 days
};

// Test 2 at a time, iterate faster
const fastTest = {
  round1: { variants: ['Download Free', 'Get Started'], totalRequired: 76000 }, // 152 days
  round2: { variants: ['round1_winner', 'Try It Now'], totalRequired: 76000 },
  // Total time is longer but you get actionable results sooner
};

Increase the MDE

If you only care about large improvements (20%+ lift), you need fewer samples:

// Detecting a 10% lift on 5% baseline: 149K per variant
const conservativeTest = calculateSampleSize(0.05, 0.10, 0.05, 0.80);

// Detecting a 20% lift on 5% baseline: 38K per variant
const aggressiveTest = calculateSampleSize(0.05, 0.20, 0.05, 0.80);

// Detecting a 30% lift on 5% baseline: 17K per variant
const largeEffectTest = calculateSampleSize(0.05, 0.30, 0.05, 0.80);

The tradeoff: you might miss real improvements of 10-19% because you weren't powered to detect them.

Lower Confidence Level

Use 90% confidence instead of 95% for exploratory tests:

// 95% confidence: 149K per variant
const standard = calculateSampleSize(0.05, 0.10, 0.05, 0.80);

// 90% confidence: 109K per variant (27% fewer)
const exploratory = calculateSampleSize(0.05, 0.10, 0.10, 0.80);

Use 90% for initial discovery (which variants are worth testing further) and 95% for final decisions.

Combine Small Tests into Larger Ones

If you have multiple low-traffic deep link campaigns, pool them:

// Bad: Run separate tests on 5 campaigns with 200 clicks/day each
// 38K / 200 = 190 days per test

// Better: Run one test across all campaigns with 1000 clicks/day total
// 38K / 1000 = 38 days
// But only if the effect is expected to be consistent across campaigns

Pre-Test Checklist

Before starting any A/B test:

function preTestChecklist(config) {
  const checks = [];

  // 1. Baseline rate available?
  const baseline = getBaseline(config.metric, 14);
  checks.push({
    name: 'Reliable baseline',
    pass: baseline.isReliable,
    value: baseline.rate,
  });

  // 2. Sample size calculated?
  const sampleSize = calculateSampleSize(baseline.rate, config.mde, 0.05, 0.80);
  checks.push({
    name: 'Sample size calculated',
    pass: true,
    value: sampleSize,
  });

  // 3. Enough traffic?
  const dailyTraffic = getDailyTraffic(config.route);
  const daysNeeded = Math.ceil(sampleSize * config.variants.length / dailyTraffic);
  checks.push({
    name: 'Runtime feasible',
    pass: daysNeeded <= 60,
    value: `${daysNeeded} days at ${dailyTraffic}/day`,
  });

  // 4. Full weeks?
  const runtime = Math.max(Math.ceil(daysNeeded / 7) * 7, 14);
  checks.push({
    name: 'Full week runtime',
    pass: true,
    value: `${runtime} days (${runtime / 7} weeks)`,
  });

  // 5. No conflicting tests?
  const conflicts = getConflictingExperiments(config.route);
  checks.push({
    name: 'No conflicts',
    pass: conflicts.length === 0,
    value: conflicts.length === 0 ? 'None' : conflicts.map(c => c.id).join(', '),
  });

  return checks;
}

Runtime Calculator

function estimateRuntime(params) {
  const {
    baselineRate,
    minimumDetectableEffect,
    dailyTraffic,
    numVariants = 2,
    significance = 0.05,
    power = 0.80,
  } = params;

  const samplePerVariant = calculateSampleSize(
    baselineRate, minimumDetectableEffect, significance, power
  );
  const totalSample = samplePerVariant * numVariants;
  const rawDays = Math.ceil(totalSample / dailyTraffic);
  const fullWeeks = Math.max(Math.ceil(rawDays / 7) * 7, 14);

  return {
    samplePerVariant,
    totalSample,
    rawDays,
    recommendedDays: fullWeeks,
    recommendedWeeks: fullWeeks / 7,
  };
}

// Example: Testing smart banner tap rate
const estimate = estimateRuntime({
  baselineRate: 0.03,       // 3% tap rate
  minimumDetectableEffect: 0.15, // Detect 15% relative lift
  dailyTraffic: 2000,       // 2000 banner impressions/day
  numVariants: 2,
});

console.log(estimate);
// {
//   samplePerVariant: ~140000,
//   totalSample: ~280000,
//   rawDays: 140,
//   recommendedDays: 140,
//   recommendedWeeks: 20,
// }

When Not to Use Sample Size Calculations

Bandit Algorithms

For optimization (not measurement), multi-armed bandit algorithms don't need fixed sample sizes. They continuously allocate more traffic to better-performing variants:

function banditAllocation(variants) {
  const total = variants.reduce((sum, v) => sum + v.successes + v.failures, 0);

  return variants.map(v => {
    const successes = v.successes + 1; // Beta prior
    const failures = v.failures + 1;
    const mean = successes / (successes + failures);
    const variance = (successes * failures) / (Math.pow(successes + failures, 2) * (successes + failures + 1));

    return {
      variantId: v.id,
      weight: mean + Math.sqrt(2 * Math.log(total) / (successes + failures)), // UCB1
    };
  });
}

Use bandits when you want to minimize regret (lost conversions during the test) rather than measure a precise lift.

Very High Traffic

If you have millions of daily impressions, any reasonable test will reach significance quickly. Focus on preventing false positives from multiple comparisons rather than worrying about sample size.

Best Practices

  1. Calculate sample size before starting: Never run a test without knowing how long it needs to run.
  2. Use your own baseline data: Don't use industry benchmarks. Measure your actual conversion rate for 2+ weeks.
  3. Plan for realistic MDE: If a 5% lift wouldn't change your business, don't power for it. Use 10-20% MDE for most deep link tests.
  4. Run for full weeks: Day-of-week effects are real. Always complete full 7-day cycles.
  5. Don't peek constantly: Pre-set checkpoints at 50%, 75%, and 100% of required sample. Only evaluate at those points.
  6. Document everything: Record your baseline, MDE, sample size, and runtime before the test starts.

For A/B testing features, see Tolinku A/B testing. For understanding test results, see the A/B testing results documentation.

Get deep linking tips in your inbox

One email per week. No spam.

Ready to add deep linking to your app?

Set up Universal Links, App Links, deferred deep linking, and analytics in minutes. Free to start.