According to IBM, the growth of data has been so exponential that "90% of the data in the world today has been created in the last two years alone." [1]  Welcome to the world of "big data."  It's no surprise that we're starting to see hedge funds touting an edge (backtested, of course), through advanced machine learning techniques in this wide world of data.  It's an informational advantage: the ability to mine and discover meaning that would otherwise go unnoticed in an ocean of data.

At the time of writing, Federal Reserve Economic Data ("FRED"), from the Federal Reserve Bank of St. Louis, tracks 61,000 US and international time series from 49 different sources.  To mis-quote Nate Silver, you can't just throw a bunch of variables in a blender and expect haute cuisine.  The danger of complex models combined with big data is we risk mining what we think are causal links from the data when they are nothing but statistical artifacts (see, for example, moon cycles [2]).

Even if there is hidden meaning in the data, can we rely on complex models to find them?

In a recent paper I posed about, Andrew Haldane performs an test[3].  He starts with a known GARCH(3,3) process and then tries to fit GARCH(n,n) models, for n in {1,2,3,4,5} for different sample sizes.  It isn't until the sample size hits nearly 100,000 samples that a fit GARCH(3,3) process has a lower mean-squared prediction error than the parsimonious GARCH(1,1) -- despite the fact that the underlying model is GARCH(3,3).  In other words, even knowing the underlying model doesn't protect us from fitting error and parameter uncertainty.

Assuming those are daily samples from market data (assuming ~252 market days per year), we would need nearly 397 years of data before our complex model outperforms our simple model.

So what good are models and data if we need an insurmountable amount before they become relevant?  Sometimes, simple models and data from history are useful to remind us of what can be.  For example, over the last 50 years, one-year realized correlation between daily S&P 500 changes and changes in the U.S. 10-year (constant maturity) yield (which should be inversely correlated to bond prices) has ranged from -93% to 86%.  In other words, going forward, the data tells us pretty much nothing about our ability to diversify our portfolio with these two assets -- which is actually important if diversification is what we care about.  This historical range makes sense, as the joint-behavior between stocks and bonds changes depending on whether the market is being driven by economic or inflationary factors.

Worse is when external factors manipulate these assets and they become reflexive of each other.  Consider that the stock market is generally thought of as being reflective of economic expectations; yet Ben Bernanke has explicitly discussed the "wealth effect" whereby consumer confidence and spending can be restored by propping up financial markets.  Suddenly, the economy becomes reflexive of the market, and it's done through interest rate manipulation.  How can we rely on fixed-income models that attempt to predict interest rates from fundamental economic information when changes in rates are now entirely "event driven"?  Fortunately, data gave us a pretty good rule of thumb to tell us how our stocks and bonds should behave: somewhere between -93% to 86% correlated.

Prediction based on statistical models requires a key element: we assume the short-term future looks like the near-term past (more technically, we assume that past frequencies imply future probabilities, and that these distributions exhibit short-term stationarity).  This assumption is not unreasonable for marketing, supply chain, customer acquisition, or even robotic failure analysis.  Markets, however, are moving targets.

So the only way we can rely on the predictive ability of our statistics is if we believe what we are modeling is independently and identically distributed.  By making our models as simple as possible and reducing the number of inputs, not only do we reduce parameter uncertainty, increasing model robustness in unforeseen market environments, but we also increase our belief that the outputs from our models can be analyzed with statistics.

That's why at Newfound we don't rely on complicated models (measured as those with a high degree of parameterization).  It's why the strategies we power are "simple" in design.  Big data and complicated models don't increase understanding -- they increase perceived understanding which is far more dangerous than simply saying "I don't know."

Here is a teaser image from an up-coming white paper I am working on:

structure_plots

 

[1] http://m.ibm.com/http/www-01.ibm.com/software/data/bigdata/

[2] Thompson, Derek.  "Do Moon Cycles Affect the Stock Market?"  The Atlantic.  Viewed at http://www.theatlantic.com/business/print/2009/06/do-moon-cycles-affect-the-stock-market/19933/

[3] Haldane, Andrew G.  "The dog and the frisbee".  31 August 2012.  Viewed at: http://www.bankofengland.co.uk/publications/Documents/speeches/2012/speech596.pdf

Corey is co-founder and Chief Investment Officer of Newfound Research, a quantitative asset manager offering a suite of separately managed accounts and mutual funds. At Newfound, Corey is responsible for portfolio management, investment research, strategy development, and communication of the firm's views to clients.

Prior to offering asset management services, Newfound licensed research from the quantitative investment models developed by Corey. At peak, this research helped steer the tactical allocation decisions for upwards of $10bn.

Corey is a frequent speaker on industry panels and contributes to ETF.com, ETF Trends, and Forbes.com’s Great Speculations blog. He was named a 2014 ETF All Star by ETF.com.

Corey holds a Master of Science in Computational Finance from Carnegie Mellon University and a Bachelor of Science in Computer Science, cum laude, from Cornell University.

You can connect with Corey on LinkedIn or Twitter.