Statistical Significance in A/B Testing: What It Really Means and How to Use It
Statistical significance is the most misunderstood concept in A/B testing. Most marketers use it wrong, interpret it wrong, and make decisions based on false confidence. This guide explains what it actually means — and how to use it correctly.
What Statistical Significance Actually Means
Note: Statistical significance does NOT mean “we’re confident the variation is better.” It means: “If there were truly no difference between A and B, the probability of seeing a result this extreme (or more extreme) by random chance is below our threshold (typically 5%).” This is a subtle but critical distinction that changes how you should make decisions.
The p-Value: Your Significance Indicator
The p-value is the probability of observing your test results (or more extreme results) IF there is actually no difference between variations.
- p = 0.05 (5%): If there’s no real difference, there’s a 5% chance you’d see results this extreme by chance
- p = 0.01 (1%): Only 1% chance of seeing these results by random chance
- p = 0.50 (50%): Coin flip — your results are entirely explainable by chance
What p-values are NOT:
- The probability that B is better than A
- The probability that the result is “real”
- The probability that you’ll see the same lift in production
- A measure of effect size or business impact
Confidence Levels: 90% vs 95% vs 99%
| Confidence Level | Alpha (Significance) | False Positive Rate | When to Use |
|---|---|---|---|
| 90% | 0.10 | 1 in 10 tests | Low-risk changes, exploratory tests |
| 95% | 0.05 | 1 in 20 tests | Standard for most CRO programs |
| 99% | 0.01 | 1 in 100 tests | High-stakes changes (pricing, checkout) |
How to choose:
- 95% is the industry standard and appropriate for most tests
- Use 90% when: the cost of being wrong is low, or you want to run tests faster
- Use 99% when: the change is hard to reverse, affects revenue directly, or has high implementation cost
Statistical Power: The Other Half of the Equation
While significance (alpha) controls false positives, statistical power (1-beta) controls false negatives.
Power = the probability of detecting a real effect when one exists.
| Power | Miss Rate (beta) | Meaning |
|---|---|---|
| 80% | 20% | You’ll miss 1 in 5 real winners (industry standard) |
| 90% | 10% | You’ll miss 1 in 10 real winners (more conservative) |
| 50% | 50% | Coin flip — you’ll miss half of all real effects |
Most underpowered tests have 30-50% power, meaning they miss the majority of real effects. This is why sample size matters so much.
The Four Possible Outcomes of Any A/B Test
| Reality: No Difference | Reality: B Is Better | |
|---|---|---|
| Test says: No Difference | Correct (True Negative) | Missed Win (False Negative / Type II Error) |
| Test says: B Is Better | False Win (False Positive / Type I Error) | Correct (True Positive) |
- Alpha controls the false positive rate (top-right cell)
- Power controls the true positive rate (bottom-right cell)
- Most teams obsess over alpha but ignore power — meaning they miss real winners constantly
Common Significance Mistakes
1. Declaring significance too early
Significance fluctuates dramatically in the early days of a test. A p-value of 0.03 on day 3 might be 0.15 on day 7 and 0.02 on day 21. Never declare a winner based on early p-values.
2. Confusing significance with importance
A test can be statistically significant but practically meaningless. A 0.1% conversion rate improvement might be significant with enough data, but it’s not worth implementing. Always pair significance with effect size.
3. Ignoring multiple comparison corrections
If you test 5 metrics simultaneously at 95% confidence, your chance of at least one false positive is ~23%, not 5%. Designate ONE primary metric or adjust your significance threshold.
4. P-hacking (unintentional)
Checking results daily and stopping when you see significance, adding more data when results aren’t significant, or slicing data until you find a “significant” segment — all inflate false positive rates.
Practical Significance vs Statistical Significance
Note: Statistical significance tells you if the effect is real. Practical significance tells you if it matters. A test should only be “called” when it passes BOTH thresholds: (1) Statistically significant — p < 0.05 (or your chosen threshold), and (2) Practically significant — the effect size is large enough to matter to your business.
Setting practical significance thresholds:
- eCommerce: Minimum 5-10% relative conversion rate improvement
- SaaS: Minimum 3-5% improvement in trial starts or signups
- Lead gen: Minimum 10-15% improvement in form submissions
One-Tailed vs Two-Tailed Tests
| Aspect | One-Tailed | Two-Tailed |
|---|---|---|
| Tests for | B is better than A (one direction) | B is different from A (either direction) |
| Sample needed | ~20% less | Standard |
| Detects harm? | No — misses negative effects | Yes — catches both improvements and degradations |
| Recommendation | Rarely appropriate | Use this (default) |
Always use two-tailed tests unless you have a specific, justified reason not to. Missing a harmful effect is worse than requiring slightly more data.
Confidence Intervals > p-Values
Confidence intervals give you more information than p-values:
Example: “The conversion rate lift is 12% +/- 8% (95% CI: 4% to 20%)”
This tells you:
- The best estimate of the effect is 12%
- We’re 95% confident the true effect is between 4% and 20%
- The effect is statistically significant (CI doesn’t include 0%)
- Even the worst case (4%) is still a meaningful improvement
Frequently Asked Questions
Is 90% confidence good enough?
For many CRO tests, yes. The cost of implementing most website changes is low, and the changes are easily reversible. 90% confidence means 1 in 10 winners might be false — but 9 in 10 are real.
What if my test never reaches significance?
Inconclusive results are still informative. They tell you the effect is probably small (smaller than your MDE). You can: accept the null hypothesis and move on, run a larger test to detect smaller effects, or test a bigger change.
How long should I wait for significance?
Pre-calculate your required sample size before the test starts. Run for at least 14 days minimum, regardless of when you hit significance. If you haven’t reached significance after 2x your planned sample, the effect is likely too small to detect.
Note: Get statistically rigorous test recommendations. Our AI audit engine generates test hypotheses with predicted effect sizes and required sample sizes — so you know which tests are worth running before you start.