I read an interesting blog post from an ex-YouTube engineer named Chris Zacharias. Chris relates a story about a project dubbed “Feather,” an endeavor to reduce page size from 1.2BM to less than 100KB (the speed at which a web-page renders in a browser is related to its size). After much effort and hand-tooling, he reduced to page size to 98KB and launched it as an opt-in feature to a small fraction of traffic.
After a week of data collection, he reviewed the numbers and found that the page latency of those users who opted-in to Feather had actually increased. This was a baffling result that no amount of in-house testing could seem to answer. That is, until one of Chris’s colleagues stumbled upon the answer: geography.
What they found was that a disproportionate amount of users were from regions of the globe where the earlier 1.2MB YouTube page had taken 20 minutes to load. By reducing the page size by a factor of 10, the page now only took 2 minutes to load — a bearable wait for those who wanted to watch a video online. Since there was a disproportionate amount of users loading pages from these regions, the average page load time had gone up. If Chris had normalized his statistics to measure the average percent time reduction in page load time instead of the actual time to load, he would have seen what he expected.
The reason I relate this story here is because it highlights the importance of context in statistics — especially “summary” statistics like averages. As data becomes more abundant and available and computer power becomes cheaper, it is tempting to simply throw data at a model. To paraphrase Nate Silver: “throwing a bunch of variables in a blender does not make haute cuisine.” Without context, numbers like correlation are just numbers; without context, you are no better off building your portfolio from correlation numbers than you are building your portfolio from the number 42 or the flip of a coin.
It is important to remember that statistics and quantitative methods give us a way to explore and summarize data, but it is imperative that we understand the context in which we are measuring.