{"id":1056,"date":"2026-05-12T09:00:00","date_gmt":"2026-05-12T14:00:00","guid":{"rendered":"https:\/\/tolinku.com\/blog\/?p=1056"},"modified":"2026-03-07T03:34:28","modified_gmt":"2026-03-07T08:34:28","slug":"statistical-significance-ab-tests","status":"publish","type":"post","link":"https:\/\/tolinku.com\/blog\/statistical-significance-ab-tests\/","title":{"rendered":"Statistical Significance for A\/B Tests: What It Means"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Statistical significance tells you whether the difference between your A\/B test variants is real or just random noise. Without understanding it, you&#39;ll either call winners too early (wasting effort on changes that don&#39;t actually work) or run tests too long (missing opportunities). This guide explains what statistical significance means in practical terms, how to calculate it, and how to use it correctly for deep link experiments.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For sample size planning, see <a href=\"https:\/\/tolinku.com\/blog\/ab-testing-sample-size\/\">A\/B Testing Sample Size Calculator for Deep Links<\/a>. For reading test results, see <a href=\"https:\/\/tolinku.com\/blog\/ab-test-measurement\/\">Measuring A\/B Test Results for Deep Link Campaigns<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><img decoding=\"async\" src=\"https:\/\/tolinku.com\/blog\/wp-content\/uploads\/2026\/03\/platform-ab-tests.png\" alt=\"Tolinku A\/B testing dashboard for smart banners\">\n<em>The A\/B tests list page showing test names, status, types, and variant counts.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What Statistical Significance Actually Means<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">The Core Question<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When you run an A\/B test and Variant B converts at 5.2% vs. Variant A&#39;s 4.8%, is that 8.3% lift real? Or could it be random chance?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Statistical significance answers this: <strong>if there were truly no difference between the variants, how likely would you be to see a result this extreme (or more extreme) by chance?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If that probability is very low (typically below 5%), we call the result &quot;statistically significant.&quot;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">P-Value Explained<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The p-value is this probability. A p-value of 0.03 means: &quot;If both variants were identical, there&#39;s a 3% chance you&#39;d see a difference this large just from random variation.&quot;<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function interpretPValue(pValue) {\n  if (pValue &lt; 0.01) return &#39;Highly significant (99%+ confidence)&#39;;\n  if (pValue &lt; 0.05) return &#39;Significant (95%+ confidence)&#39;;\n  if (pValue &lt; 0.10) return &#39;Marginally significant (90%+ confidence)&#39;;\n  return &#39;Not significant&#39;;\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Common misconceptions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>p = 0.05 does <strong>not<\/strong> mean &quot;95% chance B is better.&quot; It means &quot;5% chance of seeing this result if they were equal.&quot;<\/li>\n<li>A &quot;not significant&quot; result does <strong>not<\/strong> mean the variants are equal. It means you don&#39;t have enough evidence to conclude they&#39;re different.<\/li>\n<li>A significant result does <strong>not<\/strong> mean the effect is large. Small, meaningless differences can be significant with enough data.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Confidence Level vs. Confidence Interval<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Concept<\/th>\n<th>What It Is<\/th>\n<th>Example<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Confidence level<\/td>\n<td>The threshold for declaring significance<\/td>\n<td>95% (alpha = 0.05)<\/td>\n<\/tr>\n<tr>\n<td>Confidence interval<\/td>\n<td>The range of plausible values for the true effect<\/td>\n<td>3.2% to 13.4% lift<\/td>\n<\/tr>\n<tr>\n<td>P-value<\/td>\n<td>The probability of the observed result under no real difference<\/td>\n<td>0.03<\/td>\n<\/tr>\n<\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Confidence intervals are more useful than p-values because they tell you the <strong>magnitude<\/strong> of the effect, not just whether it exists:<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function confidenceInterval(conversionsA, totalA, conversionsB, totalB, confidenceLevel = 0.95) {\n  const pA = conversionsA \/ totalA;\n  const pB = conversionsB \/ totalB;\n  const diff = pB - pA;\n\n  const se = Math.sqrt(pA * (1 - pA) \/ totalA + pB * (1 - pB) \/ totalB);\n\n  const zScore = confidenceLevel === 0.95 ? 1.96 : confidenceLevel === 0.99 ? 2.576 : 1.645;\n\n  return {\n    pointEstimate: diff,\n    lower: diff - zScore * se,\n    upper: diff + zScore * se,\n    relativeLift: (diff \/ pA * 100).toFixed(1) + &#39;%&#39;,\n  };\n}\n\n\/\/ Example: Banner A = 300\/10000, Banner B = 350\/10000\nconst ci = confidenceInterval(300, 10000, 350, 10000);\n\/\/ { pointEstimate: 0.005, lower: -0.002, upper: 0.012, relativeLift: &#39;16.7%&#39; }\n\/\/ The confidence interval includes 0, so this is NOT significant\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Calculating Statistical Significance<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Two-Proportion Z-Test<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The standard test for comparing conversion rates:<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function zTest(conversionsA, totalA, conversionsB, totalB) {\n  const pA = conversionsA \/ totalA;\n  const pB = conversionsB \/ totalB;\n\n  \/\/ Pooled proportion\n  const pPooled = (conversionsA + conversionsB) \/ (totalA + totalB);\n\n  \/\/ Standard error\n  const se = Math.sqrt(pPooled * (1 - pPooled) * (1 \/ totalA + 1 \/ totalB));\n\n  \/\/ Z-score\n  const z = (pB - pA) \/ se;\n\n  \/\/ Two-tailed p-value (approximation)\n  const pValue = 2 * (1 - normalCDF(Math.abs(z)));\n\n  return {\n    rateA: (pA * 100).toFixed(2) + &#39;%&#39;,\n    rateB: (pB * 100).toFixed(2) + &#39;%&#39;,\n    zScore: z.toFixed(3),\n    pValue: pValue.toFixed(4),\n    significant: pValue &lt; 0.05,\n    lift: ((pB - pA) \/ pA * 100).toFixed(1) + &#39;%&#39;,\n  };\n}\n\nfunction normalCDF(x) {\n  \/\/ Approximation of the cumulative normal distribution\n  const t = 1 \/ (1 + 0.2316419 * Math.abs(x));\n  const d = 0.3989422804014327;\n  const p = d * Math.exp(-x * x \/ 2) * t *\n    (0.3193815 + t * (-0.3565638 + t * (1.781478 + t * (-1.821256 + t * 1.330274))));\n  return x &gt; 0 ? 1 - p : p;\n}\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Example Calculations<\/h3>\n\n\n\n<pre><code class=\"language-javascript\">\/\/ Scenario 1: Clear winner\nconst test1 = zTest(150, 5000, 210, 5000);\n\/\/ Rate A: 3.00%, Rate B: 4.20%\n\/\/ z = 3.12, p = 0.0018 -&gt; Significant (40% lift)\n\n\/\/ Scenario 2: Not enough data\nconst test2 = zTest(15, 500, 21, 500);\n\/\/ Rate A: 3.00%, Rate B: 4.20%\n\/\/ z = 0.99, p = 0.32 -&gt; Not significant (same rates, less data)\n\n\/\/ Scenario 3: Large sample, small difference\nconst test3 = zTest(500, 10000, 520, 10000);\n\/\/ Rate A: 5.00%, Rate B: 5.20%\n\/\/ z = 0.65, p = 0.52 -&gt; Not significant (4% lift too small to detect)\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">When to Check Results<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">The Peeking Problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Checking results repeatedly inflates your false positive rate:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Check Frequency<\/th>\n<th>Actual False Positive Rate (vs. intended 5%)<\/th>\n<\/tr>\n<\/thead>\n<tbody><tr>\n<td>Once (at planned end)<\/td>\n<td>5%<\/td>\n<\/tr>\n<tr>\n<td>Daily for 7 days<\/td>\n<td>~15%<\/td>\n<\/tr>\n<tr>\n<td>Daily for 14 days<\/td>\n<td>~20%<\/td>\n<\/tr>\n<tr>\n<td>Daily for 30 days<\/td>\n<td>~25-30%<\/td>\n<\/tr>\n<\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Every time you check, you give randomness another chance to fool you.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Solutions to the Peeking Problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Option 1: Fixed-horizon testing<\/strong>. Set a sample size. Don&#39;t look until you reach it.<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function shouldCheck(experiment) {\n  const currentSample = experiment.totalImpressions;\n  const requiredSample = experiment.requiredSampleSize;\n\n  \/\/ Only check at predetermined points\n  const checkpoints = [\n    requiredSample * 0.5,\n    requiredSample * 0.75,\n    requiredSample,\n  ];\n\n  return checkpoints.some(cp =&gt;\n    currentSample &gt;= cp &amp;&amp; currentSample &lt; cp + experiment.dailyTraffic\n  );\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Option 2: Sequential testing<\/strong>. Use a stricter threshold that adjusts for multiple looks:<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function sequentialSignificance(pValue, numLooks) {\n  \/\/ O&#39;Brien-Fleming spending function (alpha spending)\n  \/\/ Gets stricter early, more lenient later\n  const spendingFunction = (fraction) =&gt; {\n    return 2 * (1 - normalCDF(1.96 \/ Math.sqrt(fraction)));\n  };\n\n  const fraction = numLooks \/ totalPlannedLooks;\n  const adjustedAlpha = spendingFunction(fraction);\n\n  return pValue &lt; adjustedAlpha;\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Option 3: Always Valid Inference<\/strong>. Use confidence sequences that remain valid at any stopping point:<\/p>\n\n\n\n<pre><code class=\"language-javascript\">function alwaysValidConfidence(conversionsA, totalA, conversionsB, totalB) {\n  const pA = conversionsA \/ totalA;\n  const pB = conversionsB \/ totalB;\n  const diff = pB - pA;\n\n  \/\/ Wider interval that accounts for continuous monitoring\n  const se = Math.sqrt(pA * (1 - pA) \/ totalA + pB * (1 - pB) \/ totalB);\n  const mixingRate = 1 \/ Math.sqrt(totalA + totalB);\n  const adjustedWidth = 1.96 * se * Math.sqrt(1 + mixingRate * Math.log(1 + (totalA + totalB)));\n\n  return {\n    lower: diff - adjustedWidth,\n    upper: diff + adjustedWidth,\n    significant: diff - adjustedWidth &gt; 0 || diff + adjustedWidth &lt; 0,\n  };\n}\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Common Scenarios<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario 1: Significant but Small Effect<\/h3>\n\n\n\n<pre><code>Variant A: 4.00% conversion (2000 \/ 50000)\nVariant B: 4.15% conversion (2075 \/ 50000)\np-value: 0.03 (significant)\nLift: 3.75%\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The result is statistically significant, but a 3.75% relative lift might not be worth the engineering effort to implement permanently. Use <strong>practical significance<\/strong> alongside statistical significance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario 2: Large Effect but Not Significant<\/h3>\n\n\n\n<pre><code>Variant A: 3.00% conversion (30 \/ 1000)\nVariant B: 4.50% conversion (45 \/ 1000)\np-value: 0.07 (not significant)\nLift: 50%\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A 50% lift is huge, but with only 1000 samples per variant, you can&#39;t be sure it&#39;s real. Options:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Continue running the test to collect more data<\/li>\n<li>Lower your confidence threshold to 90% (marginally significant at p = 0.07)<\/li>\n<li>Use it as evidence to run a larger, properly powered test<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scenario 3: One Variant Wins on Clicks, Other on Conversions<\/h3>\n\n\n\n<pre><code>CTA clicks: A = 5.0%, B = 4.2% (A wins, p = 0.01)\nInstalls:   A = 1.8%, B = 2.3% (B wins, p = 0.04)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Both results are significant but point in different directions. This means Variant A generates more clicks but lower-quality traffic. Decide based on which metric matters more to your business (usually the downstream metric, installs in this case).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Choosing Your Confidence Level<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">95% Confidence (Standard)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use for decisions that are hard to reverse:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Changing the default deep link destination for all traffic<\/li>\n<li>Redesigning the smart banner permanently<\/li>\n<li>Modifying the onboarding flow for all new users<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">90% Confidence (Exploratory)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use for low-risk, reversible decisions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Testing CTA text that can be changed back instantly<\/li>\n<li>Initial screening of many variants (to narrow to 2-3 for a 95% test)<\/li>\n<li>Seasonal campaigns with limited time windows<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">99% Confidence (High Stakes)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use when the cost of being wrong is very high:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Changes to payment flows<\/li>\n<li>Pricing experiments<\/li>\n<li>Changes that affect regulatory compliance<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Practical Decision Framework<\/h2>\n\n\n\n<pre><code class=\"language-javascript\">function makeDecision(testResults) {\n  const { pValue, lift, confidenceInterval, sampleSize, requiredSample } = testResults;\n\n  \/\/ Check if test is complete\n  if (sampleSize &lt; requiredSample) {\n    return { decision: &#39;WAIT&#39;, reason: `Need ${requiredSample - sampleSize} more samples` };\n  }\n\n  \/\/ Check statistical significance\n  if (pValue &gt;= 0.05) {\n    if (pValue &lt; 0.10 &amp;&amp; Math.abs(lift) &gt; 10) {\n      return { decision: &#39;EXTEND&#39;, reason: &#39;Marginally significant with large effect. Run longer.&#39; };\n    }\n    return { decision: &#39;NO_WINNER&#39;, reason: &#39;No significant difference detected.&#39; };\n  }\n\n  \/\/ Check practical significance\n  if (Math.abs(lift) &lt; 3) {\n    return { decision: &#39;NO_WINNER&#39;, reason: &#39;Statistically significant but too small to matter.&#39; };\n  }\n\n  \/\/ Check confidence interval width\n  const ciWidth = confidenceInterval.upper - confidenceInterval.lower;\n  if (ciWidth &gt; 0.1) {\n    return { decision: &#39;EXTEND&#39;, reason: &#39;Significant but confidence interval too wide.&#39; };\n  }\n\n  \/\/ Winner\n  return {\n    decision: &#39;WINNER&#39;,\n    winner: lift &gt; 0 ? &#39;B&#39; : &#39;A&#39;,\n    lift: lift,\n    confidence: pValue &lt; 0.01 ? &#39;99%&#39; : &#39;95%&#39;,\n  };\n}\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Best Practices<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Set your significance threshold before the test starts<\/strong>: Don&#39;t move the goalposts after seeing results.<\/li>\n<li><strong>Report confidence intervals, not just p-values<\/strong>: The range of plausible effects is more useful than a binary yes\/no.<\/li>\n<li><strong>Use one-tailed tests only when justified<\/strong>: If you truly don&#39;t care whether B is worse than A, a one-tailed test gives more power. But be honest about your intention.<\/li>\n<li><strong>Account for multiple comparisons<\/strong>: Testing 4 variants means 6 pairwise comparisons. Use Bonferroni correction or control the false discovery rate.<\/li>\n<li><strong>Distinguish statistical from practical significance<\/strong>: A 0.5% lift that&#39;s statistically significant probably isn&#39;t worth implementing.<\/li>\n<li><strong>Don&#39;t stop tests on weekends<\/strong>: Traffic patterns differ by day. Always complete full weeks.<\/li>\n<li><strong>Document your analysis plan<\/strong>: Before starting, write down what you&#39;ll measure, what threshold you&#39;ll use, and when you&#39;ll check.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">For A\/B testing features, see <a href=\"https:\/\/tolinku.com\/features\/ab-testing\">Tolinku A\/B testing<\/a>. For interpreting results, see the <a href=\"https:\/\/tolinku.com\/docs\/user-guide\/ab-testing\/results\/\">A\/B testing results docs<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Understand statistical significance in A\/B testing. Learn when results are reliable, how to interpret p-values, confidence intervals, and when to stop a test.<\/p>\n","protected":false},"author":2,"featured_media":1055,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"rank_math_title":"Statistical Significance for A\/B Tests: What It Means","rank_math_description":"Understand statistical significance in A\/B testing. Learn when results are reliable, how to interpret p-values, and when to stop a test.","rank_math_focus_keyword":"statistical significance A\/B testing","rank_math_canonical_url":"","rank_math_facebook_title":"","rank_math_facebook_description":"","rank_math_facebook_image":"https:\/\/tolinku.com\/blog\/wp-content\/uploads\/2026\/03\/og-statistical-significance-ab-tests.png","rank_math_facebook_image_id":"","rank_math_twitter_title":"","rank_math_twitter_description":"","rank_math_twitter_image":"https:\/\/tolinku.com\/blog\/wp-content\/uploads\/2026\/03\/og-statistical-significance-ab-tests.png","footnotes":""},"categories":[13],"tags":[60,37,191,20,225,69,256,258],"class_list":["post-1056","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-growth","tag-ab-testing","tag-analytics","tag-conversions","tag-deep-linking","tag-experimentation","tag-mobile-development","tag-optimization","tag-statistics"],"_links":{"self":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts\/1056","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/comments?post=1056"}],"version-history":[{"count":2,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts\/1056\/revisions"}],"predecessor-version":[{"id":2230,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/posts\/1056\/revisions\/2230"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/media\/1055"}],"wp:attachment":[{"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/media?parent=1056"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/categories?post=1056"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tolinku.com\/blog\/wp-json\/wp\/v2\/tags?post=1056"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}