Saturday 14 April 2018

Simpson's Paradox and other common data fallacies

Data-driven decision making is helping companies flourish but with extensive data being so readily available, it is sometimes easy to overlook some of the common fallacies you can make while interpreting data. Here are four common data fallacies that can make you misinterpret your data.

Simpson's Paradox


This is a phenomenon in which a trend is observed in different groups of data but reverses or disappears when you look at the aggregated data. Let's consider an example of an e-commerce website that has a section for baby items and another for electronics and we want to analyze which traffic segment (male or female) makes more transactions. Tabulating, this is how the data could look like.




In the example above, male traffic is converting better on both categories but if we look at the overall numbers, female traffic seem to doing better by almost 7% !!

This is a classic example of Simpson's paradox, where the trend in the aggregated data either disappears or reverses. This happens because the proportions in each subsection is different so if sections of your site are segmented very differently, it is worthwhile to look at the trends in the granular data. 

Causal Inference from Correlation


This is one of the most common mistakes that people make when interpreting data. If two things happen to be together, is does not necessarily mean that one caused the other. There are countless examples of things that are correlated but where causal relationship makes no sense. One of my favorites is the correlation of suicides with US spending on science, space and technology.



If you want to look at other correlations that do not make any sense, have a look at Spurious Correlations (which is where the chart is taken from). Though extremely useful, always be critical of correlations and understand if there are any other underlying factors that may be giving the results that you see.


Gambler's Fallacy



Suppose you tossed a coin and got a head. What are your chances of getting a tail on the next toss? If your answer is more than 50% then you are a victim of the Gambler's Fallacy. People tend to look at a series of data of unlikely events and based on the historym are prejudiced to think that the next outcome will be different. In probability, these are examples of memoriless distributions where the future outcomes have no dependency or 'memory' of the history. In the case of coin toss example,

Probability of getting a head: P(H)=0.5
Probability of getting a tail:  P(T)=0.5
Probability of getting a tail given that you have already got a head P(T|H):

 (P(H) * P(T))/P(H) = 0.5

It doesn't matter if you got a head already, the probability of getting a tail has not changed. This can be extended to any number of coin tosses. When looking at your data, always be aware of these biases before making any decision.

Small Sample Size


Suppose you make a new landing page to advertise a new product on your website and see that 30% of all traffic converted within the first week. The total sample size: 50

Credits: xkcd comics


Is this a cause to be happy? Perhaps, but remember the sample size is too small to make any generalizations for the future behavior. Percentages sway wildly in small numbers so it is always important to have representative samples so that any conclusions that you draw can be useful and meaningful. 

The next time you are looking at data, keep in mind these common fallacies. Always be critical of your dataset and be aware of any underlying biases to make the most out of data !

Sunday 8 April 2018

How (not) to do an A/B test? 5 common mistakes that ruin your experiment

In the previous post, we looked at the statistics behind A/B testing. Now that we know what the numbers mean, let's look at some of the common mistakes in A/B testing that ruin results and waste valuable time and resources and how we can avoid them.

Stopping the test too early 


One major assumption in calculating the significance level is that the sample size is fixed and the experiment will be stopped once that sample size is reached. In Physical and natural sciences, this is usually not violated but in digital analytics since the is so easily accessible, people have a tendency to 'peek' at the results and stop the test as soon as the significance is reached. This is a major mistake and should always be avoided. 

Let's look at our coin toss example again and see if the number of heads is significantly different from a fair coin. This time we are going to peek in in the middle of the experiment (500 coin tosses) and measure significance. There are four possible outcomes and conclusions. In Scenario I, even though we look at the results in the middle we stop the experiment only after 1000 coin tosses.

Scenario I
In the second scenario, we look at the results in between and stop the test once significance is reached. In this case, our chances of detecting a false positive has increased simply because we stopped the test once we reached significance !

Scenario II

The probability of getting a false positive increases dramatically with how frequently you 'peek' at the data. If this was an actual A/B test, we would have declared the variant successful after five hundred visitors and stopped the test even though there was no change at all! It is okay to monitor the results but equally important to resist the temptation of pausing the test before the sample size is reached !

Confirmation Bias and pitfalls of testing without a solid hypothesis


'To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of' - Ronald Fisher

While it is a good idea to be thorough and look at the results holistically, often times confirmation bias kicks in and people expect to see results where there are none. That’s mostly the case when the A/B tests do not have well defined hypotheses. At the end of the test, people usually end up looking at multiple metrics and make the decision to go ahead even if one of them shows a positive uplift. The problem with this approach is that the more metrics you look at, the more there is a chance of having a false positive result.

Let’s take an extreme example where we look at three metrics. If we consider the results at the 95% confidence interval, then the probability of detecting a false positive is 5%. With just three metrics (assuming that they are independent), the probability of getting a false positive in any one of the metrics increases to 14% ! So, the more metrics you look at, the higher the chance that one of them will show a significant change and you might be tempted to go ahead with launching it to your full traffic.

Even though it is good to keep track of all metrics, it is  important to have a clear hypothesis and a goal in mind that you can validate your test against, rather than looking at the numbers and then arriving at a hypothesis!

Novelty Effect 




Introducing a prominent change to your website that the users are not accustomed to can create more interest in your users and drive up conversions, but the question to ask here is if the uplift is because of the change you made or the novelty of the feature itself. If you have small green buttons all over your website and suddenly you changed them to big red ones, it is very likely to intrigue your returning users who are not used to seeing big buttons. Testing this with new traffic will tell if the change results in any uplift. With feature changes like these, it is always a good idea to test them against the new users to rule out the novelty effect.


Testing over incomplete cycles


Most of the business data has a seasonal component. Traffic and customer behavior usually differ by day of the week with engagement varying on different days. Depending on how strong the seasonality is this is what a typical behavior would look like.


Let's say that you run the test for three days and stop once you have the results. Since you have not tested for all days, it is very likely the results will not be reproducible once you roll it out fully. When testing, always ensure to run the test over a full cycle. Typically that's a week, though sometimes it is necessary to test over two cycles if the sample size is not sufficient.

Not Testing with all Customer Segments

Not all Customers Segments are the Same

The traffic on your site is not uniform. Some part of the traffic will be coming from Search engines, some would be direct, some users would be repeat and loyal customers while others will be interacting with your website for the first time. What you are testing may behave differently with different customer segments, so it is always important to keep this context in mind. A prime example of this is pricing. If you are A/B testing the price of a product, it is highly likely that repeat customers will have a different behavior compared to the new users so in this case it is important to separately A/B test this against these two segments.


If you are still unsure about what confidence intervals and significance levels mean and need a refresher on the A/B testing statistics, have a look at an earlier blogpost. Comment below about your experience with A/B testing. Happy Experimenting !







Saturday 7 April 2018

Statistics of A/B testing: How to Improve Conversions on your Website



What is A/B Testing?

A/B testing is a popular method used by websites to drive up conversions, engagement or revenue. The idea is pretty simple. Show users two versions of a page on your website, one without the change and the other with the change that you want to test. Each variation is measured with a different set of users and the performance is measured. The winning variation is rolled-out to all traffic. Repeat, Roll out, Repeat and you have a winning strategy of optimizing conversions on your website.
Even though the idea behind A/B testing is simple and there are now a lot of online tools that would allow you to test and run experiments on the fly, it is important to understand the meaning and limitations of the results before you make any decisions. In this post, I have given an overview of the statistics behind A/B testing and what the associated jargon means.

Hypothesis-driven testing


While there are endless things you can test on your website, every A/B test should start out with a well-defined hypothesis. A simple example of a hypothesis is changing the color of a button to red because you believe that it provides more visibility and hence would prompt more users to click based on similar results you have seen on other pages of your websites. All A/B tests are validation or invalidation of these hypotheses and at the end of the test you want to know if it makes sense to go ahead with your changes. If you do not start out with a well defined hypothesis and a target metric, chances are that you are going to look for some metric that is affected positively and base your decision on it, which I will explain in the next post is a poor approach to take.

Statistics behind A/B Testing


Conversions as click-through probabilities

Most of the A/B tests are about getting users to click on something. The click could be on the button of a checkout page, link to a blog post or even the title of an up-selling email. In all of these, you want users to convert to the next stage of the journey like making a payment, reading the blog, or opening the email. The probability of users making a click is termed as the click-through probability and can be calculated by simply taking the ratio of users making a click over total users who were exposed to the page. All of these actions follow what is known as the binomial distribution. 

Any experiment that follows the binomial distribution can generally be characterized by the following properties:
  1. There are two mutually exclusive outcomes, often referred to as success and failure.
  2. Independent events. The outcome from one event does not have any effect on the outcome of another 
  3. Identical distribution. The probability of success remains same for all events
Now let’s look at some examples of different events and see if they can be modeled using the binomial distribution

Users completing the checkout page on an e-commerce website: There are two mutually exclusive outcomes here, either the users complete the checkout page and move to the payments page, or they don’t. Also, it is safe to assume that one user completing the checkout page would not have any impact on another user completing it. Of course, there are exceptions, for example. Family members exploring items independently but then completing the process using one account in which case the outcomes are not independent but for most cases it is safe to assume that the completion of the checkout page will be independent.

Items purchased on an e-commerce site: Using the same example above, but instead now looking at the purchases of individual items Once again, the outcomes are mutually exclusive and either the individual items would be purchased not. However, the events are not independent since a user can add multiple items in the shopping cart and then buy them together in which case the purchases are highly dependent, and the binomial distribution would not be the right choice.

Confidence Intervals and Significance Levels

One easy way to think about conversions is in terms of coin tosses since the underlying distribution for both is binomial.  Now let's run an A/B test where we are going to do hundred coin tosses wearing red and a blue shirt and see how many times we get a head.

Our hypothesis here is that wearing a red shirt improves conversion.

     55 heads                                                      44 heads

From the results, we see that we get 20% more heads when wearing a red shirt. Should we always wear a red shirt to all coin tosses? What happens if we repeat the experiment again?

One of the most important aspect of the A/B tests is that the results are reproducible. If you can't replicate the success of an experiment then there is no use of getting improvements in the sample. But how we do define success? The answer to all of the above lie in the confidence intervals. To build basic intuition, always remember that all the A/B tests are performed on samples of users and different samples will have different means. Confidence intervals allow us to capture that variation and make reasonable conclusions.

A/B test results are often quoted at a significance level. A low significance level of 5% or 1% means that it is unlikely that the difference we are seeing can occur just by chance. Or conversely, we say that the difference is ‘statistically significant’ at a confidence level of 95% or 99% (compliment of 5% and 1% respectively). 

Confidence intervals are ranges around the mean that are determined using the standard error. In the case of binomial distribution the standard error is given by   where p is the probability of success and N is the sample size.


The 95% confidence intervals are typically constructed using the normal distribution. We can use the Central Limit Theorem to approximate the outcomes of a binomial distribution as the normal distribution or a bell curve (A common approximation used to justify this is to verify that the products Np and Npq are both >5). Let's look the coin toss example again and see what the distribution looks like. This is just a normal distribution centered at 0.5 (a fair coin) and with standard error calculated above as 0.05. In a normal distribution 95% of the values lie within . Graphically the distribution for multiple experiments of 100 coin tosses would look as follows, where 95% of the observations would lie between ~0.40 and ~0.60. 


Looking back at our coin toss results, we see that getting 55 heads or 44 heads is reasonable and lies well within the 95% confidence interval so the results actually aren't surprising at all!

In summary, A 95% confidence interval means that if we were to repeat the experiment multiple times, then the interval that we constructed would cover the true population mean 95% of the time. Conversely, around 5% of the times our confidence interval will not contain the true population mean.

Type I and Type II errors


While A/B testing gives a solid launchpad to optimize product features and conversions, often at times you could see improvements where there actually aren't or the experiment can yield no results despite actual improvements. If we categorize the results from A/B tests, we can have four possible outcomes.


The two red boxes are the cases where we make wrong inference from our experimental data and these are the ones that we need to be careful about when designing the experiment. Graphically these are the overlap regions as shown below. As we discussed above, statistical significance deals with the type I error where we declare a 'winner' when there is none.  Type II errors on the other hand hide the winning variations and since running an experiment takes resources and time, you would want to consider this into account when designing the test.



Getting quick and reliable results from A/B testing

So we want to get reliable results from the A/B tests while minimizing the Type I and Type II errors. Type II errors can be minimized if we minimize the overlap in the above graph. Remember that the spread of the curve is determined by the standard error which is inversely proportional to the sample size. 

Having a large enough sample is important to minimize the Type II error. Here is the graph of the distribution for 1000 coin tosses. Increasing the sample size reduces the spread and hence minimizes the type II error.


Larger differences are easier to detect with lower chance of a type II error.  Again looking at the figure for the overlaps, it is easy to see that the further the two graphs are centered, the lower the overlap and hence lower chance of making a type II error.                       

So make bold changes to your website that would have a big impact and run the test for sufficient time and you should be good to go. Type I errors can be reduced by increasing the significance level, though that also increases the required sample size. Hence, it is always a trade-off on the accuracy of your results vs. the time that you want to invest. If you are making big changes with significant impact on the business, it is always a good idea to have enough users for your test while having a high significance level.

I hope this post was helpful in providing an overview of what's happening when you are running your A/B tests.  Let me know in the comments of any interesting A/B tests that you have run. Don't forget to read 'How (not) to do A/B testing'  in the next post to avoid making some common mistakes that most people make. Happy testing !