Part of being a quant means having an intuitive understanding of the methodologies you are utilizing. Our job is to utilize the mathematical process that lines up with our purpose at hand.
[Edit: In the spirit of openness, there is a discussion about this methodology on Cross Validated about the replicability of these results as well as the mathematical rigor. This post was not meant as an introduction of a defined, new measure, but just a work in progress. Please take it only as a pragmatic measure that I was toying around with and nothing more serious than than.]
Correlation is one of those metrics that has always bugged me. We've blogged about how it can be a deceiving metric before, but never offered up more of a solution than simply, "make sure you assume a zero-mean." Correlation is incredibly important to us because it helps us determine how to manage model failure. When our models are purely price based, a model will likely err at the same time on two highly correlated price series. In other words, a lack of asset price movement diversification means a lack of model diversification.
Of course, getting a quantitative measure of "highly correlated" to match with my intuitive understanding has never worked out so well. Correlated, for me, tends to mean: "oscillates with the same periodicity in the same overall trend." A daily correlation measure, as defined mathematically, is actually a measure of how divergent two time-series's noise is (where noise is the divergence of a time-series from its average trend). These definitions certainly don't line up.
Well, I've been playing around with a new model for similarity that I think helps fit our intuitive understanding a little better. Consider the following graph, which shows how a dollar invested in the gold ETF GLD would have fared compared to a dollar invested in the S&P 500 ETF SPY.
My intuition of correlation is that these two assets exhibit negative correlation. A traditional measure of correlation, even assuming a zero mean, is 27.03%.
Thinking more about what my intuitive definition was, I started playing around with a different measure. First, I run a 20-day simple moving average over the time series. Then, I calculate the %-difference of price from the moving average, which gives me a value that oscillates around zero (but can remain very positive or very negative for long trends). Consider the following two graphs, which show these values for GLD and SPY.
Seeing how the blue area-graphs line up in almost opposite fashion, this metric may meet our intuitive definition well. By taking the zero-mean correlation of the %-divergence values, we get a 'similarity' measure of -9% for SPY & GLD.
Obviously, this measure changes based on how long the moving average is. The longer the moving average, the more we capture the "trend" element. If I make the moving average a 40-day instead of a 20-day, I get a similarity measure of -54%. The shorter the moving average, the more we capture day-to-day noise. Technically, if we used a 1-day lagged moving average, we would actually be calculating correlation itself.
The measure is, by no means, finalized. Just a work-in-progress that I thought I would share and see if anyone has played around with something similar.
import pandas import numpy import datetime import collections import pandas.io.data as web def rolling_std_zero_mean(df, lookback): # Returns a rolling standard deviation where a mean # of zero is assumed for all random variables def _get_std(x): return numpy.sqrt(numpy.mean(numpy.square(x))) return pandas.rolling_apply(df, lookback, _get_std) def rolling_corr_pairwise_zero_mean(df, lookback): # Returns a rolling correlation matrix where a mean of # zero is assumed for all random variables all_results = collections.defaultdict(dict) for i, k1 in enumerate(df.columns): for k2 in df.columns[i:]: std_k1 = rolling_std_zero_mean(df[k1], lookback) std_k2 = rolling_std_zero_mean(df[k2], lookback) joined = df[k1] * df[k2] rolling_avg = pandas.rolling_mean(joined, lookback) corr = rolling_avg / (std_k1 * std_k2) all_results[k1][k2] = corr all_results[k2][k1] = corr return pandas.Panel.from_dict(all_results).swapaxes('items', 'major') if __name__ == '__main__': start_date = datetime.datetime(2012,10,1) end_date = datetime.datetime(2013,5,15) spy = web.get_data_yahoo('spy', start_date, end_date)['Adj Close'] gld = web.get_data_yahoo('gld', start_date, end_date)['Adj Close'] spy_ma = pandas.rolling_mean(spy, 20).dropna() gld_ma = pandas.rolling_mean(gld, 20).dropna() dist_spy = (spy / spy_ma - 1.).dropna() dist_gld = (gld / gld_ma - 1.).dropna() series = pandas.concat([dist_spy, dist_gld], axis = 1) print rolling_corr_pairwise_zero_mean(series, lookback = len(series)).ix[-1]