How A/B Test Scores Are Calculated

The Method and math at a glance

Randomness: the assignment of any one user to scenario A or B is purely random.
Statistical significance: Our statistical significance threshold is set at (p-value) >= 0.95 (p-value can be described roughly as a confidence score). A test is conclusive (statistically significant) when the significance score is >= 95%.
Mathematical formula: The mathematical formula used to calculate the scores is Two-Tailed. This means that we are not assuming anything, especially not that B is better than A. We calculate percentages in both directions: whether A is better than B or B is better than A. By using the two-tailed approach, we avoid making any assumptions about A or B - either one is possible.
Relevance Improvement: That said, the concern with A/B Testing is usually improvement: every change to B is usually intended to be better than the current main index (A). In other words, A/B Testing is trying to help us find a better index configuration.

Statistical Significance or Chance

When you run your tests, you may get results that show a 4% increase in Conversion or CTR. Statistical significance is concerned with whether the 4% increase is chance or real. The statistical concern is whether your sample group truly represents the larger population: Does the 4% make sense only for that sample group or does it reasonably predict the behavior of the larger population?

If the sample does not represent the larger population, then your results are chance (or luck, or insignificant). Chance is not good for decision-making. Statistical significance (our confidence rate) is aimed at distinguishing chance from a real probability. When you reach 95% confidence, the difference between A and B is no longer a chance happening but rather something you can confidently expect (or predict) will happen in the future.

Large, distributed samples

Large samples of data are necessary for confidence. If we flip a coin 1000 times, we should expect a close to 50/50 head/tail allotment. If we flip it just a few times, the allotment can be heavily skewed (it is completely possible to flip heads three times in a row, but very unlikely to do so 1000 times).

Increasing sample size stabilizes results, allowing increased confidence. Each new user event clarifies the underlying pattern and draws us, generally, towards an “absolute truth”.

Sample diversity

Be careful when you test. Testing during a sales campaign, a major holiday, or some other exceptional event, can undermine the reliability of your results.

Did you find this page helpful?