{"id":1100,"date":"2026-05-16T17:00:00","date_gmt":"2026-05-16T22:00:00","guid":{"rendered":"https:\/\/tolinku.com\/blog\/?p=1100"},"modified":"2026-03-07T03:34:43","modified_gmt":"2026-03-07T08:34:43","slug":"sequential-testing","status":"publish","type":"post","link":"https:\/\/tolinku.com\/blog\/sequential-testing\/","title":{"rendered":"Sequential Testing for Deep Link Experiments"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">You&#39;ve set up a deep link A\/B test comparing two landing pages. After three days, variant B is converting 18% better. Should you stop the test and ship it?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With a fixed-horizon test, the answer is no. You committed to a sample size upfront, and stopping early inflates your false positive rate. But what if you could monitor results continuously, stop as soon as you have enough evidence, and still maintain statistical validity? That&#39;s exactly what sequential testing provides.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For background on designing A\/B tests for deep links, see <a href=\"https:\/\/tolinku.com\/blog\/ab-testing-deep-links-landing-pages\/\">A\/B Testing Deep Links and Landing Pages<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/tolinku.com\/blog\/wp-content\/uploads\/2026\/03\/platform-ab-tests.png\" alt=\"Tolinku A\/B testing dashboard for smart banners\">\n<em>The A\/B tests list page showing test names, status, types, and variant counts.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Fixed-Horizon vs. Sequential Testing<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In a fixed-horizon test, you calculate a required sample size before starting, run the experiment until you reach it, then analyze the results once. It works, but it has a significant drawback: you cannot peek at the data. Every interim check inflates your Type I error rate (false positives). Checking a fixed-horizon test 5 times during the experiment can push your actual false positive rate from 5% to over 25%.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Sequential testing solves this by building interim analysis directly into the statistical framework. You define a set of checkpoints (or continuously monitor) and use adjusted significance thresholds at each look. The total Type I error across all analyses stays at your target level (typically 5%).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This matters for deep link experiments because:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Traffic is variable.<\/strong> Deep link campaigns spike during promotions and drop off between them. Waiting for a fixed sample size can take weeks during low-traffic periods.<\/li>\n<li><strong>Opportunity cost is real.<\/strong> If variant B is genuinely 30% better, running the losing variant for two more weeks costs conversions.<\/li>\n<li><strong>Multiple campaigns run in parallel.<\/strong> Faster decisions free up testing capacity for the next experiment.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">How Alpha Spending Functions Work<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Sequential testing uses <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sequential_analysis\" rel=\"nofollow noopener\" target=\"_blank\">alpha spending functions<\/a> to distribute the total allowable Type I error across multiple interim analyses. Instead of spending all 0.05 at one final look, you &quot;spend&quot; portions of alpha at each checkpoint.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Two spending functions dominate in practice:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">O&#39;Brien-Fleming<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Conservative early, aggressive late. O&#39;Brien-Fleming uses very strict thresholds for early looks and gradually relaxes them as the experiment approaches the target sample size. This makes it nearly impossible to stop very early (which is appropriate since early data is noisy), but barely penalizes you at the final analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pocock<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Equal thresholds at every look. Pocock uses the same adjusted significance level at each interim analysis. This makes early stopping easier but uses a stricter threshold at the final analysis compared to a fixed-horizon test.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Spending Function<\/th>\n<th>Early Stopping<\/th>\n<th>Final Analysis Threshold<\/th>\n<th>Best For<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>O&#39;Brien-Fleming<\/td>\n<td>Hard (strict thresholds)<\/td>\n<td>~0.048 (nearly unchanged)<\/td>\n<td>Large effects, patient teams<\/td>\n<\/tr>\n<tr>\n<td>Pocock<\/td>\n<td>Easier (uniform thresholds)<\/td>\n<td>~0.031 (stricter)<\/td>\n<td>Time-sensitive decisions<\/td>\n<\/tr>\n<\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For most deep link experiments, O&#39;Brien-Fleming is the better default. You get the option to stop early for truly dramatic effects while preserving almost full power at the final analysis.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Implementing Sequential Testing in JavaScript<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#39;s a practical implementation of a group sequential design with an O&#39;Brien-Fleming spending function.<\/p>\n\n\n\n<pre><code class=\"language-javascript\">\/**\n * O&#39;Brien-Fleming alpha spending function.\n * Returns cumulative alpha spent at fraction t of the max sample size.\n *\/\nfunction obrienFlemingSpend(t, alphaTotal = 0.05) {\n  \/\/ Approximation: alpha(t) = 2 - 2 * Phi(z_{alpha\/2} \/ sqrt(t))\n  const z = getZScore(1 - alphaTotal \/ 2);\n  const adjusted = z \/ Math.sqrt(t);\n  return 2 * (1 - normalCDF(adjusted));\n}\n\n\/**\n * Pocock alpha spending function.\n * Returns cumulative alpha spent at fraction t.\n *\/\nfunction pocockSpend(t, alphaTotal = 0.05) {\n  return alphaTotal * Math.log(1 + (Math.E - 1) * t);\n}\n\n\/**\n * Standard normal CDF approximation (Abramowitz and Stegun).\n *\/\nfunction normalCDF(x) {\n  const a1 = 0.254829592, a2 = -0.284496736, a3 = 1.421413741;\n  const a4 = -1.453152027, a5 = 1.061405429, p = 0.3275911;\n  const sign = x &lt; 0 ? -1 : 1;\n  x = Math.abs(x) \/ Math.SQRT2;\n  const t = 1.0 \/ (1.0 + p * x);\n  const y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);\n  return 0.5 * (1 + sign * y);\n}\n\nfunction getZScore(percentile) {\n  if (percentile === 0.975) return 1.96;\n  if (percentile === 0.995) return 2.576;\n  return 1.96;\n}\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Setting Up a Group Sequential Experiment<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Define interim analysis checkpoints and calculate the boundary z-scores at each look:<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function createSequentialExperiment({\n  maxSamplePerVariant,\n  numLooks,\n  alphaTotal = 0.05,\n  spendingFunction = &#39;obrien-fleming&#39;,\n}) {\n  const spend = spendingFunction === &#39;pocock&#39; ? pocockSpend : obrienFlemingSpend;\n  const fractions = Array.from(\n    { length: numLooks },\n    (_, i) =&gt; (i + 1) \/ numLooks\n  );\n\n  let prevAlpha = 0;\n  const boundaries = fractions.map((t) =&gt; {\n    const cumulativeAlpha = spend(t, alphaTotal);\n    const incrementalAlpha = cumulativeAlpha - prevAlpha;\n    prevAlpha = cumulativeAlpha;\n\n    \/\/ Convert incremental alpha to a z-score boundary (two-sided)\n    const zBoundary = inverseCDF(1 - incrementalAlpha \/ 2);\n    const sampleAtLook = Math.ceil(maxSamplePerVariant * t);\n\n    return { fraction: t, sampleAtLook, zBoundary, incrementalAlpha };\n  });\n\n  return { maxSamplePerVariant, boundaries, alphaTotal, spendingFunction };\n}\n\nfunction inverseCDF(p) {\n  \/\/ Rational approximation for the inverse normal CDF\n  if (p &lt;= 0 || p &gt;= 1) return Infinity;\n  if (p &lt; 0.5) return -inverseCDF(1 - p);\n  const t = Math.sqrt(-2 * Math.log(1 - p));\n  const c0 = 2.515517, c1 = 0.802853, c2 = 0.010328;\n  const d1 = 1.432788, d2 = 0.189269, d3 = 0.001308;\n  return t - (c0 + c1 * t + c2 * t * t) \/ (1 + d1 * t + d2 * t * t + d3 * t * t * t);\n}\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Analyzing Results at Each Checkpoint<\/h3>\n\n\n\n<pre><code class=\"language-javascript\">function analyzeAtCheckpoint(experiment, lookIndex, controlData, variantData) {\n  const boundary = experiment.boundaries[lookIndex];\n  const { sampleAtLook, zBoundary } = boundary;\n\n  \/\/ Ensure we have enough data for this look\n  if (controlData.count &lt; sampleAtLook || variantData.count &lt; sampleAtLook) {\n    return { decision: &#39;wait&#39;, reason: &#39;Insufficient sample size for this checkpoint&#39; };\n  }\n\n  \/\/ Two-proportion z-test\n  const p1 = controlData.conversions \/ controlData.count;\n  const p2 = variantData.conversions \/ variantData.count;\n  const pPooled = (controlData.conversions + variantData.conversions)\n    \/ (controlData.count + variantData.count);\n  const se = Math.sqrt(pPooled * (1 - pPooled) * (1 \/ controlData.count + 1 \/ variantData.count));\n  const zScore = (p2 - p1) \/ se;\n\n  const isSignificant = Math.abs(zScore) &gt; zBoundary;\n  const isFinalLook = lookIndex === experiment.boundaries.length - 1;\n\n  if (isSignificant) {\n    return {\n      decision: &#39;stop&#39;,\n      winner: zScore &gt; 0 ? &#39;variant&#39; : &#39;control&#39;,\n      zScore,\n      zBoundary,\n      controlRate: p1,\n      variantRate: p2,\n      relativeLift: ((p2 - p1) \/ p1 * 100).toFixed(1) + &#39;%&#39;,\n    };\n  }\n\n  if (isFinalLook) {\n    return { decision: &#39;no-difference&#39;, zScore, zBoundary };\n  }\n\n  return { decision: &#39;continue&#39;, zScore, zBoundary, nextLookAt: experiment.boundaries[lookIndex + 1].sampleAtLook };\n}\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Putting It Together for a Deep Link Test<\/h3>\n\n\n\n<pre><code class=\"language-javascript\">\/\/ A\/B test: two landing page variants for a smart banner deep link\nconst experiment = createSequentialExperiment({\n  maxSamplePerVariant: 15000,\n  numLooks: 4,          \/\/ Analyze at 25%, 50%, 75%, 100%\n  alphaTotal: 0.05,\n  spendingFunction: &#39;obrien-fleming&#39;,\n});\n\nconsole.log(&#39;Boundaries:&#39;);\nexperiment.boundaries.forEach((b) =&gt; {\n  console.log(\n    `  Look at ${b.sampleAtLook} per variant: z &gt; ${b.zBoundary.toFixed(3)} to reject`\n  );\n});\n\n\/\/ At the second interim look (50% of data collected)\nconst result = analyzeAtCheckpoint(experiment, 1,\n  { count: 7500, conversions: 525 },   \/\/ Control: 7.0% conversion\n  { count: 7500, conversions: 637 },   \/\/ Variant: 8.5% conversion\n);\n\nconsole.log(result);\n\/\/ { decision: &#39;stop&#39;, winner: &#39;variant&#39;, relativeLift: &#39;21.4%&#39;, ... }\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">For more on the statistical foundations, see <a href=\"https:\/\/tolinku.com\/blog\/statistical-significance-ab-tests\/\">Statistical Significance for A\/B Tests<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Always-Valid Inference: Continuous Monitoring<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Group sequential designs require you to pre-specify the number of looks. Always-valid inference (also called &quot;anytime-valid&quot; testing) goes a step further: you can check results after every single observation without any alpha inflation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The key idea is to use <a href=\"https:\/\/en.wikipedia.org\/wiki\/Confidence_sequence\" rel=\"nofollow noopener\" target=\"_blank\">confidence sequences<\/a> instead of confidence intervals. A confidence sequence is valid at all sample sizes simultaneously, not just at the final analysis.<\/p>\n\n\n\n<pre><code class=\"language-javascript\">\/**\n * Mixture sequential probability ratio test (mSPRT).\n * Returns a &quot;wealth&quot; value; reject H0 when wealth &gt; 1\/alpha.\n *\/\nfunction msprtStatistic(controlCount, controlConversions, variantCount, variantConversions, tau = 0.001) {\n  const pC = controlConversions \/ controlCount;\n  const pV = variantConversions \/ variantCount;\n  const n = controlCount + variantCount;\n  const pPool = (controlConversions + variantConversions) \/ n;\n\n  const variance = pPool * (1 - pPool) * (1 \/ controlCount + 1 \/ variantCount);\n  const delta = pV - pC;\n\n  \/\/ Log-likelihood ratio with normal mixture\n  const logLR = (delta * delta) \/ (2 * (variance + tau))\n    - 0.5 * Math.log(1 + variance \/ tau);\n\n  return Math.exp(logLR);\n}\n\nfunction checkAlwaysValid(controlData, variantData, alpha = 0.05) {\n  const stat = msprtStatistic(\n    controlData.count, controlData.conversions,\n    variantData.count, variantData.conversions\n  );\n\n  return {\n    statistic: stat,\n    threshold: 1 \/ alpha,   \/\/ 20 for alpha = 0.05\n    reject: stat &gt; 1 \/ alpha,\n  };\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Always-valid methods are ideal for deep link experiments where traffic is unpredictable. You don&#39;t need to estimate a maximum sample size upfront. The tradeoff is slightly lower power compared to group sequential designs at any fixed sample size. In practice, you may need 10-30% more observations to reach the same conclusion.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">When to Use Sequential vs. Fixed-Horizon<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Scenario<\/th>\n<th>Recommended Approach<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Predictable, high traffic (10K+ clicks\/day)<\/td>\n<td>Fixed-horizon (simplest, most powerful)<\/td>\n<\/tr>\n<tr>\n<td>Moderate traffic, want early stopping for large effects<\/td>\n<td>Group sequential (O&#39;Brien-Fleming)<\/td>\n<\/tr>\n<tr>\n<td>Low or unpredictable traffic, continuous monitoring needed<\/td>\n<td>Always-valid inference<\/td>\n<\/tr>\n<tr>\n<td>Testing during a time-limited campaign (product launch, holiday)<\/td>\n<td>Group sequential (Pocock for easier early stopping)<\/td>\n<\/tr>\n<tr>\n<td>Multiple concurrent tests across many deep link routes<\/td>\n<td>Always-valid (no need to pre-plan sample sizes)<\/td>\n<\/tr>\n<\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For multivariate experiments with more than two variants, see <a href=\"https:\/\/tolinku.com\/blog\/multivariate-testing-deep-links\/\">Multivariate Testing for Deep Link Campaigns<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Practical Decision Rules<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Sequential testing gives you statistical rigor. But real decisions require more than a p-value. Here are practical rules for deep link experiments:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>1. Set a minimum effect size before starting.<\/strong> If a 2% lift in conversion isn&#39;t worth the engineering effort to ship variant B, don&#39;t run the test to detect 2% lifts. Set your minimum detectable effect at 10% or higher and size the experiment accordingly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>2. Define a maximum duration.<\/strong> Even with sequential testing, set a time limit. Deep link behavior changes with seasonality, app updates, and user composition. A test running for 6 months is measuring a moving target.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>3. Use futility boundaries.<\/strong> In addition to checking whether the variant is significantly better, check whether it&#39;s &quot;probably not better.&quot; If the z-score is close to zero at 50% of the planned sample size, consider stopping for futility and moving on to a different test.<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function checkFutility(zScore, fractionComplete, futilityThreshold = 0.5) {\n  \/\/ If the effect is small relative to what we&#39;d need at this\n  \/\/ fraction, continuing is unlikely to yield significance.\n  const projectedFinalZ = zScore \/ Math.sqrt(fractionComplete);\n  return projectedFinalZ &lt; futilityThreshold;\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>4. Account for novelty effects.<\/strong> New deep link experiences often show inflated engagement in the first 48 hours. Exclude the first day or two from your sequential analysis, or set the first checkpoint at no less than 25% of the planned sample.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>5. Log every decision.<\/strong> Record the checkpoint, z-score, boundary, and outcome. When you review experiments quarterly, you want a clear audit trail showing that each decision was statistically grounded, not just a gut call.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Applying This in Tolinku<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Tolinku&#39;s <a href=\"https:\/\/tolinku.com\/features\/ab-testing\">A\/B testing feature<\/a> lets you split traffic across deep link variants and track conversion metrics per variant. When setting up an experiment, you can configure sequential testing parameters in your Appspace:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Choose your alpha spending function (O&#39;Brien-Fleming or Pocock)<\/li>\n<li>Set the number of interim analyses (3-5 looks is typical)<\/li>\n<li>Define your maximum sample size per variant<\/li>\n<li>Set futility boundaries to stop underperforming tests early<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The analytics dashboard shows your current z-score relative to the sequential boundary at each checkpoint, making it clear whether you&#39;ve crossed the threshold for a valid decision.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Key Takeaways<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Sequential testing is not about cutting corners. It&#39;s about using your statistical budget more efficiently. Fixed-horizon tests waste time when the answer is obvious early, and they tempt teams into peeking (which invalidates the results). Sequential testing formalizes early stopping so you can act on strong evidence without compromising validity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For deep link experiments specifically, the combination of variable traffic, time-sensitive campaigns, and multiple concurrent tests makes sequential testing the more practical choice. Start with a group sequential design using O&#39;Brien-Fleming spending. If you find that traffic is too unpredictable to pre-plan sample sizes, move to always-valid inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The goal is the same as any testing framework: make better decisions, faster, with fewer false starts.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Use sequential testing to get valid A\/B test results faster. Monitor experiments continuously without inflating false positive rates.<\/p>\n","protected":false},"author":2,"featured_media":1099,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"Sequential Testing for Deep Link Experiments","rank_math_description":"Use sequential testing to get valid A\/B test results faster. Monitor experiments continuously without inflating false positive rates.","rank_math_focus_keyword":"sequential testing deep links","rank_math_canonical_url":"","rank_math_facebook_title":"","rank_math_facebook_description":"","rank_math_facebook_image":"https:\/\/tolinku.com\/blog\/wp-content\/uploads\/2026\/03\/og-sequential-testing.png","rank_math_facebook_image_id":"","rank_math_twitter_title":"","rank_math_twitter_description":"","rank_math_twitter_image":"https:\/\/tolinku.com\/blog\/wp-content\/uploads\/2026\/03\/og-sequential-testing.png","footnotes":""},"categories":[13],"tags":[60,37,191,20,225,69,256,258],"class_list":["post-1100","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-growth","tag-ab-testing","tag-analytics","tag-conversions","tag-deep-linking","tag-experimentation","tag-mobile-development","tag-optimization","tag-statistics"],"_links":{"self":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts\/1100","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/comments?post=1100"}],"version-history":[{"count":2,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts\/1100\/revisions"}],"predecessor-version":[{"id":2244,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts\/1100\/revisions\/2244"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/media\/1099"}],"wp:attachment":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/media?parent=1100"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/categories?post=1100"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/tags?post=1100"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}