An Important Data Lesson from an Inconsequential Football Scandal
“… one of big data analysis’ most under-appreciated problems: talking about reverse causation. In reverse causation problems, we know the result and we work backwards to understand the causes.
Reverse causation investigations have the opposite structure from A/B tests, in which we vary known causes, and observe how the variations affect an outcome. If the number of visitors to your website jumped after you changed the image on your Facebook page, you conclude that the new photo is the reason for the traffic surge. (Note: Good A/B test construction can help you see most likely causes; bad A/B test construction creates its own set of problems.).
By contrast, the biggest obstacle to solving reverse causation is the infinite number of possible causes that might influence the known outcome. This is compounded by the fact that we want to assign a cause. So when some data is plucked out of a large set that fits a narrative we may have already constructed, it’s very tempting to simply assign causation when it doesn’t exist. …
Big data is exposing all kinds of outliers and trends we hadn’t seen before and we’re assigning causes somewhat recklessly, because it makes a good story, or helps confirm our biases. You see this all the time in your Twitter stream: ‘7 Charts that Explain This.’ Or ‘The One Chart that Tells You Why Something Is Happening.’ We’re getting better and better at analyzing and visualizing big data to spot coincidences, outliers and trends. It’s getting easier and easier to convince ourselves of specific narratives without any real data to support them.
Most good statistical analysis will be narratively unsatisfying, loaded down with ‘we don’t know,’ ‘it depends,’ and ‘the data can’t prove that.’
You can see how this can become a big problem for companies wanting to exploit the big data they’re amassing. If you think about most practical data problems, they often concern reverse causation. The sales of a particular product suddenly plunged; what caused it? The number of measles cases spiked up in a neighborhood; how did it happen? People with a certain brand of phone tend to shop at certain stores; why is that? In cases like these, we know the outcome, and we often don’t know the cause.
The possibility of any number of causes tempts us to retrofit a narrative but we must resist it. The astute analyst is one who figures out how to bring a manageable structure to this work. See this post by statistician Andrew Gelman for further thoughts.”