Scheduling A/B Tests: When to Start and Stop

Timing determines whether your A/B test produces actionable insights or misleading noise. Starting too soon means you lack the baseline to detect real differences. Stopping too early means you're acting on random fluctuations. Running too long wastes traffic you could spend on the next experiment. This guide covers when to start, how long to run, and when to stop A/B tests for deep link campaigns.

For statistical foundations, see Statistical Significance for A/B Tests: What It Means. For calculating traffic requirements before you begin, see A/B Testing Sample Size Calculator for Deep Links.

Tolinku A/B testing dashboard for smart banners The A/B tests list page showing test names, status, types, and variant counts.

Prerequisites: Before You Start

Establish a Baseline

Never launch an A/B test without at least two weeks of baseline data. You need to understand your current conversion rate and its natural variance before you can detect meaningful changes. If your deep link click-to-install rate fluctuates between 2.8% and 3.5% on a normal week, a test result of 3.3% tells you nothing without that context.

Collect these baselines first:

Primary metric: the conversion rate you're testing (click-through, install, purchase)
Traffic volume: daily and weekly click counts for the route you're testing
Variance patterns: how much your metric fluctuates day-over-day and week-over-week
Day-of-week patterns: most apps see significant traffic differences between weekdays and weekends

Calculate Required Sample Size

Use your baseline data to determine how many visitors each variant needs. The calculation depends on your baseline conversion rate, the minimum effect size you want to detect, and your chosen significance level.

function estimateTestDuration(dailyTraffic, requiredSamplePerVariant, numVariants) {
  const totalRequired = requiredSamplePerVariant * numVariants;
  const rawDays = Math.ceil(totalRequired / dailyTraffic);

  // Round up to the nearest full week
  const fullWeeks = Math.ceil(rawDays / 7);
  return {
    minimumDays: rawDays,
    recommendedDays: fullWeeks * 7,
    fullWeeks,
    totalSampleNeeded: totalRequired
  };
}

// Example: 500 daily visitors, need 3,800 per variant, 2 variants
const estimate = estimateTestDuration(500, 3800, 2);
// { minimumDays: 16, recommendedDays: 21, fullWeeks: 3, totalSampleNeeded: 7600 }

For detailed sample size calculations, see the A/B Testing Sample Size Calculator for Deep Links.

Verify Technical Readiness

Before starting, confirm these items:

Analytics tracking fires correctly for both variants
Deep link routing works for all test paths (verify with Tolinku's testing tools)
No upcoming deployments will change the pages or flows under test
Traffic allocation splits correctly (50/50 or your chosen ratio)

How Long to Run Tests

The Full-Week Rule

Always run tests in complete weeks. User behavior varies dramatically by day of week. A test that starts Monday and ends Thursday captures none of the weekend pattern, which could differ by 30% or more in conversion rate.

Scenario	Minimum Duration	Recommended Duration
High traffic (5,000+ daily clicks)	7 days	14 days
Medium traffic (1,000-5,000 daily)	14 days	21 days
Low traffic (200-1,000 daily)	21 days	28 days
Very low traffic (<200 daily)	28+ days	Consider alternative methods

Minimum Duration Floor

Even if you reach your required sample size in three days, do not stop the test. Short tests are vulnerable to novelty effects, day-of-week bias, and temporary traffic anomalies. The absolute minimum for any test is seven days, regardless of traffic volume.

function shouldTestContinue(test) {
  const now = new Date();
  const startDate = new Date(test.startedAt);
  const daysElapsed = (now - startDate) / (1000 * 60 * 60 * 24);
  const fullWeeksElapsed = Math.floor(daysElapsed / 7);

  // Enforce minimum duration
  if (fullWeeksElapsed < 1) {
    return { continue: true, reason: 'Minimum 1 full week not yet elapsed' };
  }

  // Check if we have enough samples
  const controlSample = test.variants[0].visitors;
  const treatmentSample = test.variants[1].visitors;

  if (controlSample < test.requiredSampleSize || treatmentSample < test.requiredSampleSize) {
    return { continue: true, reason: 'Required sample size not reached' };
  }

  // Must end on a complete week boundary
  if (daysElapsed % 7 !== 0) {
    return { continue: true, reason: 'Waiting for current week to complete' };
  }

  return { continue: false, reason: 'Test is eligible to stop' };
}

Maximum Duration Cap

Tests should not run indefinitely. Set a maximum duration upfront, typically four to six weeks. Beyond that point, external factors (app updates, competitor launches, seasonal shifts) erode the validity of your comparison. If you haven't reached significance by your cap, the effect is likely too small to matter for your business.

When to Stop a Test

Significance Reached

The primary stop condition is reaching your pre-defined significance threshold (typically p < 0.05) after the minimum duration has passed. Both conditions must be true: statistical significance AND minimum duration completed.

Stop Condition	Action
Significant result + minimum duration met	Stop test, declare winner
Significant result + minimum duration NOT met	Continue until minimum duration
Not significant + maximum duration reached	Stop test, declare inconclusive
Not significant + maximum duration not reached	Continue running

Futility Analysis

Sometimes you can tell early that a test will never reach significance. If the observed effect is trending in the opposite direction or is very close to zero after 50% of the planned duration, you can perform a futility check.

function checkFutility(test) {
  const progressRatio = test.currentSample / test.totalRequiredSample;

  // Only check futility after 50% of planned sample is collected
  if (progressRatio < 0.5) {
    return { futile: false, reason: 'Too early for futility analysis' };
  }

  const controlRate = test.variants[0].conversions / test.variants[0].visitors;
  const treatmentRate = test.variants[1].conversions / test.variants[1].visitors;
  const observedLift = (treatmentRate - controlRate) / controlRate;

  // If the observed effect is in the wrong direction after 50%+ data
  if (observedLift < 0 && test.expectedDirection === 'positive') {
    return {
      futile: true,
      reason: `Treatment is performing ${(observedLift * 100).toFixed(1)}% worse after ${(progressRatio * 100).toFixed(0)}% of data collected`
    };
  }

  // If observed effect is less than 20% of the minimum detectable effect
  const mdeRatio = Math.abs(observedLift) / test.minimumDetectableEffect;
  if (mdeRatio < 0.2 && progressRatio > 0.7) {
    return {
      futile: true,
      reason: 'Observed effect is too small to reach significance within planned duration'
    };
  }

  return { futile: false, reason: 'Test may still reach significance' };
}

External Event Interruptions

Some events invalidate your test entirely. When they occur, you should stop the test, discard the data, and plan to rerun later.

Events that require stopping:

A major app update changes the flow under test
A server outage causes tracking gaps
A viral event causes a sudden, abnormal traffic spike
A platform policy change affects link behavior (e.g., iOS or Android updates to deep link handling)

Events that require noting but not necessarily stopping:

A minor marketing campaign launches (segment this traffic if possible)
A holiday that was accounted for in planning
Normal seasonal fluctuation within expected ranges

Common Timing Mistakes

Stopping Too Early ("Peeking")

The most common mistake. You check results on day three, see a 15% lift with p = 0.04, and declare a winner. The problem: with small samples, random variation produces large swings. If you peek repeatedly and stop whenever significance appears, your actual false positive rate can exceed 30%, far above the 5% threshold you think you're using.

Solutions:

Set a minimum duration and do not check results before it passes
Use sequential testing methods that account for multiple looks
Pre-register your stopping criteria before the test begins

Running Too Long

The opposite problem. A test that runs for three months captures seasonal shifts, app updates, and user behavior changes that have nothing to do with your variant. Long-running tests also carry an opportunity cost: every day spent on an inconclusive test is a day not spent on the next hypothesis.

Ignoring Day-of-Week Effects

Starting a test on Wednesday and ending it the following Tuesday gives you unequal representation of each day. Weekend users often behave differently from weekday users (different intent, different devices, different conversion rates). Always start and stop on the same day of the week.

Overlapping Tests

Running multiple tests on the same route or user segment creates interaction effects. Variant A in test one might boost conversions, but only when combined with variant B in test two. The result: both tests show significance, but neither result holds when deployed independently.

Scheduling Around Calendar Events

Holiday and Seasonal Planning

Block out periods when user behavior is abnormal. These windows vary by industry, but common ones include:

Period	Impact	Recommendation
Black Friday / Cyber Monday	2x-10x traffic spikes, different user intent	Do not start tests; pause running tests
Christmas / New Year (Dec 20 – Jan 5)	Lower engagement, gift-related behavior	Avoid starting new tests
Back to school (Aug – Sep)	Traffic shifts in education and retail apps	Account for in baseline, or avoid
App store feature / Product Hunt launch	Abnormal traffic source mix	Pause tests or segment traffic
Major OS releases (Sep – Oct)	New deep link behaviors, browser changes	Pause link behavior tests

Building a Test Calendar

Plan tests quarterly. Map out your experiment roadmap against known events, product launches, and marketing campaigns. This prevents conflicts and ensures each test gets a clean window.

function findNextTestWindow(calendarEvents, testDurationDays) {
  const today = new Date();
  let candidateStart = new Date(today);

  // Start on the next Monday
  const dayOfWeek = candidateStart.getDay();
  const daysUntilMonday = dayOfWeek === 0 ? 1 : (8 - dayOfWeek) % 7 || 7;
  candidateStart.setDate(candidateStart.getDate() + daysUntilMonday);

  while (true) {
    const candidateEnd = new Date(candidateStart);
    candidateEnd.setDate(candidateEnd.getDate() + testDurationDays);

    const hasConflict = calendarEvents.some(event => {
      const eventStart = new Date(event.start);
      const eventEnd = new Date(event.end);
      return candidateStart <= eventEnd && candidateEnd >= eventStart;
    });

    if (!hasConflict) {
      return {
        start: candidateStart.toISOString().split('T')[0],
        end: candidateEnd.toISOString().split('T')[0],
        durationDays: testDurationDays
      };
    }

    // Try the next Monday
    candidateStart.setDate(candidateStart.getDate() + 7);
  }
}

// Example usage
const blockedPeriods = [
  { start: '2026-11-25', end: '2026-12-02', name: 'Black Friday / Cyber Monday' },
  { start: '2026-12-20', end: '2027-01-05', name: 'Holiday season' }
];

const window = findNextTestWindow(blockedPeriods, 21);
// Returns the next clean 21-day window starting on a Monday

Automated Test Scheduling

Configuring Stop Criteria

When creating A/B tests in Tolinku, define your stop criteria upfront. This removes the temptation to peek and make emotional decisions.

const testConfig = {
  name: 'Homepage deep link CTA test',
  route: '/promo/summer',
  variants: [
    { name: 'Control', weight: 50 },
    { name: 'New CTA copy', weight: 50 }
  ],
  schedule: {
    startDate: '2026-06-01',       // Must be a Monday
    minimumDurationDays: 14,        // At least 2 full weeks
    maximumDurationDays: 42,        // Hard stop at 6 weeks
    significanceThreshold: 0.05,    // p < 0.05
    minimumSamplePerVariant: 3800
  },
  stopConditions: {
    significanceReached: true,      // Auto-stop when significant + min duration met
    futilityCheck: true,            // Auto-stop if futility detected after 50%
    maxDurationReached: true        // Auto-stop at max duration
  }
};

Monitoring Without Peeking

The goal is to automate monitoring so you don't need to check manually. Set up alerts for operational issues (tracking failures, traffic drops) without exposing intermediate results.

function configureTestAlerts(testId) {
  return {
    operational: [
      { type: 'traffic_drop', threshold: 0.5, message: 'Traffic dropped 50%+ from baseline' },
      { type: 'tracking_error', threshold: 0.01, message: 'Error rate exceeds 1%' },
      { type: 'variant_imbalance', threshold: 0.1, message: 'Traffic split deviates 10%+ from target' }
    ],
    results: [
      { type: 'test_complete', message: 'Test reached stop criteria' },
      { type: 'futility_detected', message: 'Test stopped for futility' }
    ]
  };
}

Best Practices Summary

Collect two weeks of baseline data before starting any test.
Calculate sample size first, then estimate duration. Never guess.
Always run in full-week increments, starting and ending on the same day.
Set minimum and maximum durations before the test begins.
Do not peek at results before the minimum duration passes.
Use futility analysis to save traffic on tests that clearly won't reach significance.
Block out holidays and abnormal periods in your test calendar.
Avoid overlapping tests on the same routes or user segments.
Automate stop criteria to remove subjective decision-making.
Document everything: hypothesis, start date, stop criteria, and results. Future you will thank present you.

For a broader look at testing strategies for deep links, see A/B Testing Deep Links and Landing Pages. To set up your first test, follow the A/B testing guide in the Tolinku docs.

A/B Testing Analytics conversions Deep Linking experimentation mobile-development optimization statistics

Get deep linking tips in your inbox

One email per week. No spam.