The Unreasonable Effectiveness of Data

A 2009 paper from Google researchers coined the term “physics envy”: the desire of those of us in fields plagued by human behavior to be able to create neat, mathematical models.  After all, it isn’t gas particles that rush in a disorderly manner to the exit when someone shouts “fire”: only humans do that.

Specifically, the paper discusses the applications of models to natural language processing (“NLP”).  Linguistic modelers share the same envy as financial modelers: cultural context and shared social experience can lead to highly ambiguous expressions that are still easily understood.  By humans, at least.

The authors say,

Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

The paper goes on to discuss the benefit that “data at web scale” has been for NLP, especially in relation to machine learning.  In particular there has been significant growth in statistical speech recognition and statistical machine translation — areas of advance not because they are simpler than other areas, but because the web provides such a fertile training set of data for these models.  These are tasks humans perform every day.  Document classification, on the other hand, hasn’t seen the same benefits.

With the benefit of large data, simple is better.  To quote,

But invariably, simple models and a lot of data trump more elaborate models based on less data.

I am particularly fond of how the authors conclude the paper,

So, follow the data.  Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data.  Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.

Mining Meaning from Data

Can we extend this beyond NLP and assume that domain knowledge will soon be trumped by simple data-mining?  After all, global data is becoming more abundant, not less (some studies say that is doubling every two years).  In particular — and relevant to this author — can we extend the elegance of large data to investment management?

My belief is that the answer is “no” and my reasoning is that the success of the application of the “more data” approach requires that the solution to our problem is actually in the data itself.  In other words, not only does the data hold a lot of details, but there is a sufficient amount of data that can be used to extract the necessary details.  When the data holds the details, the tradeoff you can make is model complexity versus data quantity.

The underlying assumption here is that “more data” is “more relevant data.”  For the data to be relevant, the system we are modeling has to be fairly stable: the near future has to look like the past.  While language is constantly evolving, the core of the corpus changes at a glacial pace.  For example, the first e-mail ever sent by Ray Tomlinson in 1971 is going to be just as relevant as any e-mail written today for an NLP algorithm to learn from.

When More Isn’t More

In my opinion, financial markets and weather share a lot in common.  When it comes to weather data, it could be argued that there is quite a lot of details: temperature, pressure, humidity, precipitation, et cetera.  So why are we so bad at predicting weather evolution?  This is particularly puzzling given that we know the mathematical laws (fluid dynamics) that govern weather patterns.  If we know the model, with certainty, and we have plenty of data with plenty of detail, why can’t my weatherman accurately tell me whether it is going to rain on January 8th, 2014 or not?

The problem is that weather demonstrates chaotic behavior, in the mathematical sense.  Most relevantly, while entirely deterministic, weather is nonlinear and therefore incredibly sensitive to initial conditions.  To quote Edward Lorenze, “the present determines the future, but the approximate present does not approximately determine the future.”  This is more commonly known as the butterfly effect: the ratio of initial uncertainty in our system is very, very small compared to the uncertainty in the system after a period of time.

This ultimately means that drawing on a large set of historical samples where initial weather conditions are similar may provide little to no insight into how the weather will evolve.  More is not necessarily more.

In financial markets, one could argue that the aggregate beliefs of investors are ultimately translated to prices, in which case the data holds a lot of detail.  However, I believe that financial markets are both non-deterministic and nonlinear.  So, not only does the near future become incredibly uncertain very quickly, but I believe that the unpredictability and irrationality of market participants means that for identical initial conditions, different decisions may be made.  This makes historical samples almost entirely useless.  As a simple example, consider how relevant market data from 2006 was to decision making in October 2008.

The Unreasonable Effectiveness of Simplicity

In my opinion, building tactical portfolios from a handful of asset-class level ETFs shares a lot in common with trying to pick the winner of Wimbledon (or, really, the bracket); managing risk is like trying to identify avalanches.  Neither “more data” nor complex models provide sufficient solutions for these problems.  Instead, it is simple tallying methods that reign supreme over optimized weightings and complex algorithms.

As another example, in “The Dog and the Frisbee,” Andrew Haldane demonstrated that a GARCH(1,1) model had a lower mean-squared prediction error than a GARCH(3,3) model until over 100,000 data points were used for fitting — despite the fact that a GARCH(3,3) model was used to generate the data itself!  The simpler model was able to reduce the error in its output by limiting the amount of error that could be introduced by its inputs.

Our approach to investment management is driven by “simple”.  We believe in simple models built on simple theories (e.g. momentum) combined simple rules (e.g. tallying and bucketing).

In uncertain environments, simple heuristics tend to be more robust than complex decision rules.  And markets are very, very uncertain.

The Takeaway

As quants, we should always be looking towards engineering, mathematical and modeling advances in other fields.  However, we must always look through the lens of our domain knowledge to determine their applicability.  In my opinion, “big data” does not bring much to the table when it comes to tactical — or strategic — portfolio construction.

Corey is co-founder and Chief Investment Officer of Newfound Research, a quantitative asset manager offering a suite of separately managed accounts and mutual funds. At Newfound, Corey is responsible for portfolio management, investment research, strategy development, and communication of the firm's views to clients. Prior to offering asset management services, Newfound licensed research from the quantitative investment models developed by Corey. At peak, this research helped steer the tactical allocation decisions for upwards of $10bn. Corey holds a Master of Science in Computational Finance from Carnegie Mellon University and a Bachelor of Science in Computer Science, cum laude, from Cornell University. You can connect with Corey on LinkedIn or Twitter.