How Long Should You Run an A/B Test? (Sample Size Math)
Key Takeaways
- There is no universal run for two weeks rule — duration depends entirely on your traffic, baseline conversion rate, and the minimum detectable effect (MDE) you care about.
- At 95% confidence and 80% power, a page with a 3% baseline conversion rate needs roughly 25,000 visitors per variant to reliably detect a 20% relative lift.
- Always run tests in full 7-day increments so you capture every day-of-week behavior, even if statistical significance arrives mid-week.
- Peeking at results and stopping the moment p is less than 0.05 appears is the single most common cause of false positive wins — fix the sample size before you start.
- If your traffic is too low to hit sample size in a reasonable time, either widen the MDE, test higher-impact elements, or switch to qualitative research instead.
The honest answer to how long should I run an A/B test is: run it until you hit the sample size your effect size requires, rounded up to a full number of weeks — not a day less, not a day more. For most copy tests on a page converting at 2-4%, that lands somewhere between two and six weeks. But the number that matters is not calendar time, it is visitors per variant, and that number flows directly from three inputs you control: your baseline conversion rate, the minimum detectable effect you want to catch, and your weekly traffic. In this guide I will walk through the math behind each of those inputs, give you a practical lookup table for common traffic levels, and explain why peeking at your dashboard every morning is the fastest way to fool yourself.
- Why two weeks is the wrong answer
- The three variables that determine test duration
- A practical rule of thumb (with traffic table)
- The full-week business cycle rule
- What to do with low-traffic pages
- Why you should not peek at results
- How Copysplit handles duration automatically
Why two weeks is the wrong answer
The always run tests for two weeks rule of thumb became popular because it sounds authoritative and is easy to remember. The problem is that it is wrong in both directions. A high-traffic ecommerce homepage running on a site with 500,000 monthly visitors will blow past statistical significance on a headline test in 3-4 days — running it for 14 is just delaying your next experiment. Meanwhile, a SaaS pricing page with 8,000 monthly visitors testing a subhead tweak worth a 10% relative lift might need eight weeks to reach significance, and calling it after two weeks would give you a meaningless result.
Test duration is not a number you choose. It is a number you calculate. The mistake most teams make is reversing that relationship — picking a time window first, then hoping the data cooperates. In our experience auditing experiment programs, more than 60% of winning tests that fail to replicate in production were stopped early based on a fixed calendar window rather than a fixed sample size. The fix is to flip the process: calculate your required sample size before the test starts, then let traffic determine how long you run.
The three variables that determine test duration
Three inputs determine how many visitors you need per variant, and therefore how many days your test will run. The first is your baseline conversion rate — the current rate for the control version. Lower baselines require much larger samples because conversions are rarer and noisier. A page converting at 1% needs roughly nine times the sample of a page converting at 10% to detect the same relative lift.
The second is the minimum detectable effect (MDE), usually expressed as a relative percentage. Detecting a 5% lift needs roughly 16 times the sample of detecting a 20% lift. Most copy tests realistically produce lifts in the 8-25% range, so setting MDE below that is mathematically possible but rarely practical — you will run tests for months. The third is your weekly traffic to the tested page, which determines how quickly you accumulate the sample you need. Multiply required sample size by two (you split traffic between control and variant), divide by weekly traffic, and round up to the nearest full week.
A practical rule of thumb
Here is a reference table for common traffic levels, all assuming 95% confidence, 80% power, and a 50/50 traffic split. These are approximate — your exact numbers will shift slightly based on your baseline — but they are accurate within a week for planning purposes.
- 5,000 visitors/month, 2% baseline, 20% MDE: roughly 6 weeks per test
- 10,000 visitors/month, 3% baseline, 20% MDE: roughly 3 weeks per test
- 25,000 visitors/month, 3% baseline, 20% MDE: roughly 10-14 days per test
- 50,000 visitors/month, 4% baseline, 20% MDE: roughly 7-10 days per test
- 100,000 visitors/month, 4% baseline, 15% MDE: roughly 2 weeks per test
- 250,000 visitors/month, 5% baseline, 10% MDE: roughly 2-3 weeks per test
- 500,000+ visitors/month, 5% baseline, 10% MDE: roughly 7-10 days per test
Use this as a starting point, not gospel. If your baseline is higher than the row suggests, you will finish faster; if lower, slower. And notice the pattern: sites below 10,000 monthly visitors on the tested page really do need to plan in months, not weeks — there is no shortcut around that without accepting a larger MDE.
Want to understand the statistical side of calling a winner? Read our deep dive on when a test is actually significant.
Read the significance guide →The full-week business cycle rule
Even after you hit your required sample size, there is one more rule: always run in full 7-day increments. Never stop a test on a Thursday because the math says you are done. The reason is that conversion behavior varies dramatically by day of week — B2B SaaS signups cluster Tuesday through Thursday, consumer ecommerce peaks on weekends, and mobile traffic often behaves differently on weekdays versus weekends. If you run for 9 days, you have two Mondays, two Tuesdays, and two Wednesdays but only one Saturday and Sunday. That imbalance can easily create or erase a 5-10% apparent lift.
The rule: calculate your sample size, divide by weekly traffic, and round up to the next full week. A test that mathematically needs 11 days should run for 14. A test that needs 17 days should run for 21. This single habit eliminates an entire class of false positives from your program and costs you only a few extra days per experiment.
What to do with low-traffic pages
If the math says your test will take ten weeks to reach significance, you have three honest options. First, widen the MDE. Instead of testing to detect a 10% lift, design bolder variants — completely different value propositions, different offers, different hero structures — and test for a 25% lift. Bigger swings need smaller samples. Second, test higher-traffic pages. If your pricing page gets 3,000 visits per month but your homepage gets 30,000, run your first experiments on the homepage where you will actually learn something in a reasonable timeframe.
Third, accept that A/B testing may not be the right tool for this page yet. Qualitative research — user interviews, session recordings, five-second tests, heuristic reviews — will teach you more in a week than a two-month underpowered A/B test ever will. We worked with an early-stage SaaS client who insisted on A/B testing their pricing page with 1,200 monthly visitors. After three months of inconclusive tests, they switched to structured customer interviews, identified the real objection in ten conversations, and rewrote the page based on that. The rewrite lifted conversions 34% on its own — no test needed.
Avoid the other common traps that kill experiment programs.
Read the mistakes guide →Why you should not peek at results
Peeking — checking your test dashboard daily and stopping the moment p is less than 0.05 flashes green — is the single most expensive mistake in A/B testing. The math behind frequentist significance assumes you look at the data exactly once, at the end of the predetermined sample size. Every additional peek inflates your actual false positive rate. If you check significance 10 times during a test and stop at the first green light, your effective false positive rate is not 5% — it is closer to 20-30%. That is why so many winning tests fail to replicate when shipped.
The honest limitation: peeking is tempting because early results look dramatic. A variant can sit at +40% lift with p = 0.03 after 300 conversions and then regress to +3% with p = 0.4 after 2,000 conversions. Small samples are noisy. The only way to avoid being fooled is to commit to the sample size calculation up front, write it down, and refuse to act on interim results. If you genuinely cannot wait — which is rare — use a sequential testing method or a Bayesian framework that accounts for continuous monitoring. But do not peek on a frequentist test and pretend the math still works.
How Copysplit handles duration automatically
Copysplit calculates your required sample size before the test starts, based on your actual baseline conversion rate and the MDE you select. We use frequentist statistics with 95% confidence and 80% power — the same standards used by every serious experimentation platform — and we explicitly prevent the dashboard from showing winner language until sample size is met. You still see live progress, but the call-to-ship is locked until the math supports it. That single guardrail prevents most peeking-driven false positives without requiring you to remember the rules.
We also round up to full weeks automatically, flag low-traffic pages where the test will take more than 60 days (with a suggestion to widen the MDE), and surface the required sample size in plain English before you launch. If you are new to the discipline, Copysplit AI copy generation is designed to produce variants bold enough to detect realistic lifts — we bias toward meaningful differences rather than tiny word swaps, because small edits need enormous samples.
Ready to run copy tests with the math handled for you? Start a free Copysplit trial.
Start free trial →Frequently asked questions
What is the minimum length for an A/B test?▾
Can I stop a test early if significance hits quickly?▾
What if my test never reaches significance?▾
Do I need more traffic to test copy than to test design?▾
How does MDE affect my required sample size?▾
Should I exclude returning visitors from my sample size math?▾
Test duration is not a question of patience or gut feel — it is a question of arithmetic. Once you accept that the three inputs (baseline rate, MDE, and traffic) fully determine how long you need to run, the anxiety around is it done yet mostly disappears. You calculate the sample size, you round up to the nearest full week, you ignore the dashboard until you hit that number, and you trust the result. The teams that ship the most winning changes year-over-year are not the ones running the most tests; they are the ones running tests long enough to trust. Set the sample size before you launch, honor it, and your win rate — and your credibility with the rest of the business — will climb together.
Ready to test your copy?
Stop guessing which headlines convert. Start testing with Copysplit today.
Start Free Trial →