Saturday 14 April 2018

Simpson's Paradox and other common data fallacies

Data-driven decision making is helping companies flourish but with extensive data being so readily available, it is sometimes easy to overlook some of the common fallacies you can make while interpreting data. Here are four common data fallacies that can make you misinterpret your data.

Simpson's Paradox


This is a phenomenon in which a trend is observed in different groups of data but reverses or disappears when you look at the aggregated data. Let's consider an example of an e-commerce website that has a section for baby items and another for electronics and we want to analyze which traffic segment (male or female) makes more transactions. Tabulating, this is how the data could look like.




In the example above, male traffic is converting better on both categories but if we look at the overall numbers, female traffic seem to doing better by almost 7% !!

This is a classic example of Simpson's paradox, where the trend in the aggregated data either disappears or reverses. This happens because the proportions in each subsection is different so if sections of your site are segmented very differently, it is worthwhile to look at the trends in the granular data. 

Causal Inference from Correlation


This is one of the most common mistakes that people make when interpreting data. If two things happen to be together, is does not necessarily mean that one caused the other. There are countless examples of things that are correlated but where causal relationship makes no sense. One of my favorites is the correlation of suicides with US spending on science, space and technology.



If you want to look at other correlations that do not make any sense, have a look at Spurious Correlations (which is where the chart is taken from). Though extremely useful, always be critical of correlations and understand if there are any other underlying factors that may be giving the results that you see.


Gambler's Fallacy



Suppose you tossed a coin and got a head. What are your chances of getting a tail on the next toss? If your answer is more than 50% then you are a victim of the Gambler's Fallacy. People tend to look at a series of data of unlikely events and based on the historym are prejudiced to think that the next outcome will be different. In probability, these are examples of memoriless distributions where the future outcomes have no dependency or 'memory' of the history. In the case of coin toss example,

Probability of getting a head: P(H)=0.5
Probability of getting a tail:  P(T)=0.5
Probability of getting a tail given that you have already got a head P(T|H):

 (P(H) * P(T))/P(H) = 0.5

It doesn't matter if you got a head already, the probability of getting a tail has not changed. This can be extended to any number of coin tosses. When looking at your data, always be aware of these biases before making any decision.

Small Sample Size


Suppose you make a new landing page to advertise a new product on your website and see that 30% of all traffic converted within the first week. The total sample size: 50

Credits: xkcd comics


Is this a cause to be happy? Perhaps, but remember the sample size is too small to make any generalizations for the future behavior. Percentages sway wildly in small numbers so it is always important to have representative samples so that any conclusions that you draw can be useful and meaningful. 

The next time you are looking at data, keep in mind these common fallacies. Always be critical of your dataset and be aware of any underlying biases to make the most out of data !

No comments:

Post a Comment