Statistical Significance in A/B Testing Explained (Plain English)
Key Takeaways
- Statistical significance means your result is unlikely to be caused by random chance — nothing more, nothing less.
- The 95% confidence threshold is a convention, not a law of nature. It balances false positives against test velocity.
- A p-value is the probability of seeing your observed lift if the variant actually performed the same as control.
- Peeking at results before you hit the planned sample size inflates your false positive rate dramatically — often from 5% to 25% or more.
- Statistical significance tells you a difference is real. It does not tell you the difference is large enough to matter commercially.
Statistical significance in A/B testing is the probability that the difference you measured between your variant and control was not caused by random chance. That is it. Every stats textbook complicates this definition, but the plain-English version is enough to run tests responsibly. When a tool says your variant is significant at 95%, it means that if the variant were actually identical to control in the real world, you would only see a result this extreme about 5% of the time by accident. This guide walks through what significance really means, why 95% became the default, how p-values work, what Type I and Type II errors are, and why peeking early ruins more tests than bad copy does. No stats degree required — just clear thinking about what your numbers actually say.
- What statistical significance actually means
- The 95% confidence threshold (and why it exists)
- Understanding p-values without a stats degree
- Type I and Type II errors in plain English
- Confidence intervals vs. significance
- Why the peeking problem ruins tests
- Frequentist vs. Bayesian — a brief orientation
What statistical significance actually means
Statistical significance is a statement about probability, not truth. When a test reaches 95% significance, it does not mean your variant is 95% likely to be better. It means that assuming the variant and control performed identically, there is only a 5% chance you would see a difference this large (or larger) purely from random visitor behavior. The distinction matters because people routinely read 95% significant as 95% chance this variant wins — and that is mathematically wrong.
Think of it this way. If you flip a fair coin ten times, you might get seven heads. That does not mean the coin is rigged. It means random variation is real and noisy over small samples. Statistical significance is the formal framework for deciding when the seven-heads pattern is extreme enough that the coin is rigged becomes the more reasonable explanation. In A/B testing terms, your control is the fair coin. Your variant is the possibly-rigged coin. Significance is the threshold where you stop attributing the difference to luck and start attributing it to the change you made.
The 95% confidence threshold (and why)
The 95% threshold is not a mathematical law — it is a convention that hardened into standard practice because statistician Ronald Fisher casually suggested it in the 1920s. He never claimed 95% was optimal. He just thought a 1-in-20 false positive rate was a reasonable tradeoff for agricultural experiments. A century later, nearly every A/B testing platform, including Copysplit, defaults to it because the industry standardized around what Fisher happened to say.
That does not mean 95% is wrong — it means it is a choice. If you are testing a change that could tank revenue (like a checkout flow rewrite), you might want 99% confidence. If you are testing a headline on a low-stakes blog post, 90% might be fine. The threshold is a dial, not a verdict. What matters is that you set the threshold before the test starts and stick to it. Moving your threshold after seeing results is called p-hacking, and it invalidates everything. In our experience, teams who pre-commit to 95% and stay disciplined about it catch far fewer false winners than teams who treat the threshold as a suggestion.
Understanding p-values without a stats degree
A p-value is the single most misunderstood number in A/B testing. Here is the plain-English version: the p-value is the probability of seeing your observed result (or something more extreme) if the variant and control were actually performing the same. If your p-value is 0.03, that means there is a 3% chance you would see this much lift even if the variant had zero real effect. Because 3% is below the 5% threshold (1 minus 0.95), you call it significant.
The critical thing the p-value does not tell you: how likely it is that your variant is actually better. That is a different question entirely, and answering it requires Bayesian reasoning. A small p-value does not mean a big effect, either — with enough traffic, even a 0.1% lift can reach significance. So p-values answer one specific question: is this result surprising enough to not be chance? That is useful. But it is narrow. Treat the p-value as a filter against randomness, not as a measure of business impact or probability of success.
Copysplit uses frequentist statistics with a 95% confidence threshold by default — the most widely accepted standard for copy testing.
Start your 14-day free trial →Type I and Type II errors in plain English
Every A/B test can fail in two ways. A Type I error is a false positive — you declare a winner when the variant actually performs the same as control. A Type II error is a false negative — a real winner exists, but you miss it because the test did not have enough statistical power to detect the effect. The 95% confidence threshold controls your Type I error rate (5%). Your sample size and effect size control your Type II error rate.
Most teams obsess over Type I errors and ignore Type II. That is a mistake. In our experience, for every false winner teams accidentally ship, they kill two or three real winners by ending underpowered tests too early and declaring no significant difference. A test that concludes we did not find a winner at 2,000 visitors per variant often would have found one at 8,000. Power matters. Before you run a test, estimate the minimum detectable effect your sample size can actually catch. If you can only detect 15%+ lifts and most real winners are 5-8%, you are setting yourself up for Type II errors on every test you run.
Confidence intervals vs. significance
Significance is binary — yes or no, above or below the threshold. Confidence intervals are continuous — they tell you the range of plausible true effects. A test might report variant lifted conversions by 12%, 95% confidence interval 3% to 21%. That means you are 95% confident the true lift is somewhere between 3% and 21%. Significance answers: is there an effect? Confidence intervals answer: how big is the effect, realistically?
We recommend looking at confidence intervals for every test, not just the significance verdict. A test can be significant with a huge interval — say, 0.5% to 25%. That is technically a winner, but the true effect could be anywhere from trivial to massive. You do not know which. Narrow intervals mean you have a precise estimate. Wide intervals mean you should probably keep running the test to tighten them up, even if you have already crossed the significance threshold. Confidence intervals are how you separate statistically real from actually useful.
Want the practical side — when to actually call a winner and ship the change?
Read when to call a winner →Why the peeking problem ruins tests
Peeking is the single most destructive habit in A/B testing. It works like this: you start a test, check it after two days, see the variant is 95% significant, and call the winner. What you do not realize is that by checking early, you have just inflated your false positive rate from 5% to somewhere between 20% and 30%. Every time you peek and stop, you are giving the test another chance to cross the threshold by random fluctuation. Run enough peeks and you will eventually cross it by pure luck — even if the variant is identical to control.
A team we worked with last year ran a hero headline test, peeked at day three, saw 97% significance, and shipped the variant. Two weeks later they checked historical data and the winning variant was actually underperforming. They had caught a random spike early and locked it in. Fixed-horizon frequentist tests assume you check the result once, at the pre-planned sample size. If you check more often, your math breaks. The fix is simple: decide your sample size before the test starts, and do not look at results until you hit it. Sequential testing methods exist that allow safe peeking, but most standard A/B tools (including Copysplit by default) assume you are running fixed-horizon tests. Honor the assumption or the stats lie to you.
Frequentist vs. Bayesian — a brief orientation
Everything we have covered so far — p-values, confidence intervals, the 95% threshold — belongs to the frequentist school of statistics. There is a competing approach called Bayesian statistics that answers different questions in different ways. Bayesian tests report things like: there is a 92% probability that the variant is better than control, which is a much more intuitive statement than we reject the null hypothesis at p less than 0.05. Bayesian methods also handle peeking more gracefully, because they update continuously rather than waiting for a fixed sample size.
Both approaches are mathematically valid. Frequentist is the industry default because it is simpler to implement, easier to audit, and what most stakeholders are familiar with. Copysplit uses frequentist statistics at a 95% confidence threshold — the standard most teams expect. Honest limitation: frequentist stats struggle when you have very low traffic or when you genuinely need to make decisions with incomplete data. If that is your situation, Bayesian may fit better. We wrote a full comparison of the two approaches if you want to go deeper.
For a full breakdown of both approaches and when each makes sense:
Read Bayesian vs. frequentist A/B testing →Frequently asked questions
Does 95% significance mean my variant has a 95% chance of winning?▾
Can I just run my test until it hits significance and then stop?▾
Is statistical significance the same as practical significance?▾
Why is 95% the standard and not 90% or 99%?▾
What is a p-value in one sentence?▾
Does Copysplit use frequentist or Bayesian statistics?▾
Want to avoid the most common pitfalls that turn real winners into false positives?
Read common A/B testing mistakes →Statistical significance is a tool, not an oracle. It tells you whether a result is unlikely to be random — nothing more. It does not tell you how big the effect is, whether it is worth shipping, or whether it will hold up next quarter. Treat it as one input among several: pair the significance verdict with confidence intervals, practical business context, and a pre-committed sample size. Resist the urge to peek, resist the urge to move thresholds after the fact, and resist the urge to read 95% significant as 95% likely to win. If you internalize just those three disciplines, you will run more honest tests than 90% of teams — and you will ship fewer false winners that quietly drag down conversions months later.
Keep reading
Ready to test your copy?
Stop guessing which headlines convert. Start testing with Copysplit today.
Start Free Trial →