Friday 4 May 2018

Using Visualizations to (un)cover Insights

If data is a desert, then insights are the oasis and visualization is the vehicle that is going to help you navigate. Visualizations help in simplifying and understanding the data and make the process of generating insights easier.  In this blog, I have looked at four different examples where visualizations can either mislead your audience or help uncover important insights.

Cropped Y axis:


This is one of the most common ways in which a visualization can be very misleading. Consider the example where you have run an A/B test on your site and are now presenting the results to the team.
What message do you think the team get from the first visualization?


From the first graph, it seems that we have made a big improvement and are now converting more users on our website.  However if you look closely, the axis starts from 30% mark and heavily exaggerates minor differences. The actual difference is two percent (0.312-0.306/0.306) which lies well within the normal variation associated with this conversion rate. Another common case it to truncate the y-axis to exaggerate trends. Consider the example below where we look at the trend of percentage of users making a purchase. The graph is completely misleading and gives an impression that we have seen some great growth even though the absolute difference is only exaggerated by the truncated y-axis.


Should we never truncate the Y-Axis? In almost all the cases, truncating the y-axis gives a wrong impression of the data and should always be avoided. However, in cases where small changes can have a big impact, people truncate the y-axis to highlight the change. For example, rise in tax percentage could be small but it might have a big impact on an average tax payer's savings. Always avoid truncating the y-axis, but if you end up doing it anyways, highlight it very clearly so that the viewers are not misled by the visualization.

Looking at a subset of data:


Data is hardly ever uniform. Sales peak during black Friday and Christmas season, traffic on websites vary by the hour or the day or so on. When looking at the data, you can often draw wrong conclusions if you neglect seasonality or look at only part of the data. Using the same example above of the percentage of people making a purchase and see what different different data points can tell you. If you are a manager, wanting to brag you might want be tempted to show the above graph with the truncated y-axis. When we reset the origin to zero, this is what we get.




Even though the increase doesn't look as great as before, we have still improved since October. But is this telling the complete picture? Let's look at the historical trend and see how we compare with the same months last year.  A quick glance at the data shows that there are slight seasonal variations with successful completions increasing towards December which is what we see in our graph, but the clear insight is that we are actually trending downwards.


So, if we neglect historical data we would be making a completely different conclusion! When working with seasonal data or time-series data in general, it is always important to look at the entire cycle before drawing any conclusions.

Looking at aggregated data


Let's look at an example of an e-commerce website deals with baby-items and electronics and we are interested in the behavior of males and females on the website. Users can make a purchase in each of these and one of the key metrics that we look at is conversion, i.e. the percentage of people who make a purchase after visiting the website. This is what the aggregated numbers may look like.




So overall it looks like we engage and convert the female segment better. From this it would be reasonable to assume that females convert better in at-least one of the subcategories if not both. Lets look at the segmentation a level deeper.



It is surprising to see that when you look at a level deeper, males actually convert better on our platform on both of the categories !! The reason for this is the difference in proportions between the two segments in these categories.



This was a simplified example and for most of the websites and businesses you will have many more products or subcategories. In such a case, some products would always cater to a larger segment and the underlying distribution could be very different from an aggregated one. This is a manifestation of the  'Simpson's paradox' which happens because of the mismatch in underlying proportions and can cause the overall trends to disappear or reverse. You can read more about this on 'Simpson's Paradox and other common data fallacies'.


Comparing multiple metrics on different scales

Often, you want to look at data that vary at different scales. One example of this is total purchases and visits on a website. Usually, customers will visit multiple times before making a purchase and also a lot of customers would visit your website without making a purchase, which means visits will be higher than the purchases. If you want to look at the trend for these metrics together, it will look something like this:




One way to approach is add another axis and plot the data on that. Here, the second axis is cropped to make the trend visible. This is how the graph looks like now:



We are now able to see the trend of the purchases a lot better but it is difficult to make much out of it since the point we choose to crop at is arbitrary. Also we see several points of intersection on the graph which do not have any meaning since that only occurs because of our choice of the scale. A better way to analyze trends together for metrics on different scales is to compare these with a previous point. For example, let's look at how the percentage change in values since first January.




Now you can make out the trends much clearer. There is weekly seasonal variation in both metrics though it is more pronounced for purchases, the trend otherwise is pretty flat. This was a basic example but you can see that by a simple 'translation' of the metrics, we were able to compare trends easily which were otherwise slightly difficult to uncover.