The Research Library of Newfound Research

Tag: quality

The Limit of Factor Timing

This post is available as a PDF download here.

Summary­

  • We have shown previously that it is possible to time factors using value and momentum but that the benefit is not large.
  • By constructing a simple model for factor timing, we examine what accuracy would be required to do better than a momentum-based timing strategy.
  • While the accuracy required is not high, finding the system that achieves that accuracy may be difficult.
  • For investors focused on managing the risks of underperformance – both in magnitude and frequency – a diversified factor portfolio may be the best choice.
  • Investors seeking outperformance will have to bear more concentration risk and may be open to more model risk as they forego the diversification among factors.

A few years ago, we began researching factor timing – moving among value, momentum, low volatility, quality, size etc. – with the hope of earning returns in excess not only of the equity market, but also of buy-and-hold factor strategies.

To time the factors, our natural first course of action was to exploit the behavioral biases that may create the factors themselves. We examined value and momentum across the factors and used these metrics to allocate to factors that we expected to outperform in the future.

The results were positive. However, taking into account transaction costs led to the conclusion that investors were likely better off simply holding a diversified factor portfolio.

We then looked at ways to time the factors using the business cycle.

The results in this case were even less convincing and were a bit too similar to a data-mined optimal solution to instill much faith going forward.

But this evidence does not necessarily remove the temptation to take a stab at timing the factors, especially since explicit transactions costs have been slashed for many investors accessing long-only factors through ETFs.Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. 

After all, there is a lot to gain by choosing the right factors. For example, in the first 9 months of 2019, the spread between the best (Quality) and worst (Value) performing factors was nearly 1,000 basis points (“bps”). One month prior, that spread had been double!

In this research note, we will move away from devising a systematic approach to timing the factors (as AQR asserts, this is deceptively difficult) and instead focus on what a given method would have to overcome to achieve consistent outperformance.

Benchmarking Factor Timing

With all equity factor strategies, the goal is usually to outperform the market-cap weighted equity benchmark.

Since all factor portfolios can be thought of as a market cap weighted benchmark plus a long/short component that captures the isolated factor performance, we can focus our study solely on the long/short portfolio.

Using the common definitions of the factors (from Kenneth French and AQR), we can look at periods over which these self-financing factor portfolios generate positive returns to see if overlaying them on a market-cap benchmark would have added value over different lengths of time.1

We will also include the performance of an equally weighted basket of the four factors (“Blend”).

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

The persistence of factor outperformance over one-month periods is transient. If the goal is to outperform the most often, then the blended portfolio satisfies this requirement, and any timing strategy would have to be accurate enough to overcome this already existing spread.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

The results for the blended portfolio are so much better than the stand-alone factors because the factors have correlations much lower than many other asset classes, allowing even naïve diversification to add tremendous value.

The blended portfolio also cuts downside risk in terms of returns. If the timing strategy is wrong, and chooses, for example, momentum in an underperforming month, then it could take longer for the strategy to climb back to even. But investors are used to short periods of underperformance and often (we hope) realize that some short-term pain is necessary for long-term gains.

Looking at the same analysis over rolling 1-year periods, we do see some longer periods of factor outperformance. Some examples are quality in the 1980s, value in the mid-2000s, momentum in the 1960s and 1990s, and size in the late-1970s.

However, there are also decent stretches where the factors underperform. For example, the recent decade for value, quality in the early 2010s, momentum sporadically in the 2000s, and size in the 1980s and 1990s. If the timing strategy gets stuck in these periods, then there can be a risk of abandoning it.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

Again, a blended portfolio would have addressed many of these underperforming periods, giving up some of the upside with the benefit of reducing the risk of choosing the wrong factor in periods of underperformance.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

And finally, if we extend our holding period to three years, which may be used for a slower moving signal based on either value or the business cycle, we see that the diversified portfolio still exhibits outperformance over the most rolling periods and has a strong ratio of upside to downside.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

The diversified portfolio stands up to scrutiny against the individual factors but could a generalized model that can time the factors with a certain degree of accuracy lead to better outcomes?

Generic Factor Timing

To construct a generic factor timing model, we will consider a strategy that decides to hold each factor or not with a certain degree of accuracy.

For example, if the accuracy is 50%, then the strategy would essentially flip a coin for each factor. Heads and that factor is included in the portfolio; tails and it is left out. If the accuracy is 55%, then the strategy will hold the factor with a 55% probability when the factor return is positive and not hold the factor with the same probability when the factor return is negative. Just to be clear, this strategy is constructed with look-ahead bias as a tool for evaluation.

All factors included in the portfolio are equally weighted, and if no factors are included, then the returns is zero for that period.

This toy model will allow us to construct distributions to see where the blended portfolio of all the factors falls in terms of frequency of outperformance (hit rate), average outperformance, and average underperformance. The following charts show the percentiles of the diversified portfolio for the different metrics and model accuracies using 1,000 simulations.2

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

In terms of hit rate, the diversified portfolio behaves in the top tier of the models over all time periods for accuracies up to about 57%. Even with a model that is 60% accurate, the diversified portfolio was still above the median.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

For average underperformance, the diversified portfolio also did very well in the context of these factor timing models. The low correlation between the factors leads to opportunities for the blended portfolio to limit the downside of individual factors.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

For average outperformance, the diversified portfolio did much worse than the timing model over all time horizons. We can attribute this also to the low correlation between the factors, as choosing only a subset of factors and equally weighting them often leads to more extreme returns.

Overall, the diversified portfolio manages the risks of underperformance, both in magnitude and in frequency, at the expense of sacrificing outperformance potential. We saw this in the first section when we compared the diversified portfolio to the individual factors.

But if we want to have increased return potential, we will have to introduce some model risk to time the factors.

Checking in on Momentum

Momentum is one model-based way to time the factors. Under our definition of accuracy in the toy model, a 12-1 momentum strategy on the factors has an accuracy of about 56%. While the diversified portfolio exhibited some metrics in line with strategies that were even more accurate than this, it never bore concentration risk: it always held all four factors.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

For the hit rate percentiles of the momentum strategy, we see a more subdued response. Momentum does not win as much as the diversified portfolio over the different time periods.

But not winning as much can be fine if you win bigger when you do win.

The charts below show that momentum does indeed have a higher outperformance percentile but with a worse underperformance percentile, especially for 1-month periods, likely due to mean reversionary whipsaw.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. Data from July 1957 – September 2019.

While momentum is definitely not the only way to time the factors, it is a good baseline to see what is required for higher average outperformance.

Now, turning back to our generic factor timing model, what accuracy would you need to beat momentum?

Sharpening our Signal

The answer is: not a whole lot. Most of the time, we only need to be about 53% accurate to beat the momentum-based factor timing.

Source: Kenneth French Data Library, AQR. Calculations by Newfound Research. Past performance is not an indicator of future results. Performance is backtested and hypothetical. Performance figures are gross of all fees, including, but not limited to, manager fees, transaction costs, and taxes.  Performance assumes the reinvestment of all distributions. 

The caveat is that this is the median performance of the simulations. The accuracy figure climbs closer to 60% if we use the 25th percentile as our target.

While these may not seem like extremely high requirements for running a successful factor timing strategy, it is important to observe that not many investors are doing this. True accuracy may be hard to discover, and sticking with the system may be even harder when the true accuracy can never be known.

Conclusion

If you made it this far looking for some rosy news on factor timing or the Holy Grail of how to do it skillfully, you may be disappointed.

However, for most investors looking to generate some modest benefits relative to market-cap equity, there is good news. Any signal for timing factors does not have to be highly accurate to perform well, and in the absence of a signal for timing, a diversified portfolio of the factors can lead to successful results by the metrics of average underperformance and frequency of underperformance.

For those investors looking for higher outperformance, concentration risk will be necessary.

Any timing strategy on low correlation investments will generally forego significant diversification in the pursuit of higher returns.

While this may be the goal when constructing the strategy, we should always pause and determine whether the potential benefits outweigh the costs. Transaction costs may be lower now. However, there are still operational burdens and the potential stress caused by underperformance when a system is not automated or when results are tracked too frequently.

Factor timing may be possible, but timing and tactical rotation may be better suited to scenarios where some of the model risk can be mitigated.

Factor Fimbulwinter

This post is available as a PDF download here.

Summary­

  • Value investing continues to experience a trough of sorrow. In particular, the traditional price-to-book factor has failed to establish new highs since December 2006 and sits in a 25% drawdown.
  • While price-to-book has been the academic measure of choice for 25+ years, many practitioners have begun to question its value (pun intended).
  • We have also witnessed the turning of the tides against the size premium, with many practitioners no longer considering it to be a valid stand-alone anomaly. This comes 35+ years after being first published.
  • With this in mind, we explore the evidence that would be required for us to dismiss other, already established anomalies.  Using past returns to establish prior beliefs, we simulate out forward environments and use Bayesian inference to adjust our beliefs over time, recording how long it would take for us to finally dismiss a factor.
  • We find that for most factors, we would have to live through several careers to finally witness enough evidence to dismiss them outright.
  • Thus, while factors may be established upon a foundation of evidence, their forward use requires a bit of faith.

In Norse mythology, Fimbulvetr (commonly referred to in English as “Fimbulwinter”) is a great and seemingly never-ending winter.  It continues for three seasons – long, horribly cold years that stretch on longer than normal – with no intervening summers.  It is a time of bitterly cold, sunless days where hope is abandoned and discord reigns.

This winter-to-end-all-winters is eventually punctuated by Ragnarok, a series of events leading up to a great battle that results in the ultimate death of the major gods, destruction of the cosmos, and subsequent rebirth of the world.

Investment mythology is littered with Ragnarok-styled blow-ups and we often assume the failure of a strategy will manifest as sudden catastrophe.  In most cases, however, failure may more likely resemble Fimbulwinter: a seemingly never-ending winter in performance with returns blown to-and-fro by the harsh winds of randomness.

Value investors can attest to this.  In particular, the disciples of price-to-book have suffered greatly as of late, with “expensive” stocks having outperformed “cheap” stocks for over a decade.  The academic interpretation of the factor sits nearly 25% belowits prior high-water mark seen in December 2006.

Expectedly, a large number of articles have been written about the death of the value factor.  Some question the factor itself, while others simply argue that price-to-book is a broken implementation.

But are these simply retrospective narratives, driven by a desire to have an explanation for a result that has defied our expectations?  Consider: if price-to-book had exhibited positive returns over the last decade, would we be hearing from nearly as large a number of investors explaining why it is no longer a relevant metric?

To be clear, we believe that many of the arguments proposed for why price-to-book is no longer a relevant metric are quite sound. The team at O’Shaughnessy Asset Management, for example, wrote a particularly compelling piece that explores how changes to accounting rules have led book value to become a less relevant metric in recent decades.1

Nevertheless, we think it is worth taking a step back, considering an alternate course of history, and asking ourselves how it would impact our current thinking.  Often, we look back on history as if it were the obvious course.  “If only we had better prior information,” we say to ourselves, “we would have predicted the path!”2  Rather, we find it more useful to look at the past as just one realized path of many that’s that could have happened, none of which were preordained.  Randomness happens.

With this line of thinking, the poor performance of price-to-book can just as easily be explained by a poor roll of the dice as it can be by a fundamental break in applicability.  In fact, we see several potential truths based upon performance over the last decade:

  1. This is all normal course performance variance for the factor.
  2. The value factor works, but the price-to-book measure itself is broken.
  3. The price-to-book measure is over-crowded in use, and thus the “troughs of sorrow” will need to be deeper than ever to get weak hands to fold and pass the alpha to those with the fortitude to hold.
  4. The value factor never existed in the first place; it was an unfortunate false positive that saturated the investing literature and broad narrative.

The problem at hand is two-fold: (1) the statistical evidence supporting most factors is considerable and (2) the decade-to-decade variance in factor performance is substantial.  Taken together, you run into a situation where a mere decade of underperformance likely cannot undue the previously established significance.  Just as frustrating is the opposite scenario. Consider that these two statements are not mutually exclusive: (1) price-to-book is broken, and (2) price-to-book generates positive excess return over the next decade.

In investing, factor return variance is large enough that the proof is not in the eating of the short-term return pudding.

The small-cap premium is an excellent example of the difficulty in discerning, in real time, the integrity of an established factor.  The anomaly has failed to establish a meaningful new high since it was originally published in 1981.  Only in the last decade – nearly 30 years later – have the tides of the industry finally seemed to turn against it as an established anomaly and potential source of excess return.

Thirty years.

The remaining broadly accepted factors – e.g. value, momentum, carry, defensive, and trend – have all been demonstrated to generate excess risk-adjusted returns across a variety of economic regimes, geographies, and asset classes, creating a great depth of evidence supporting their existence. What evidence, then, would make us abandon faith from the Church of Factors?

To explore this question, we ran a simple experiment for each factor.  Our goal was to estimate how long it would take to determine that a factor was no longer statistically significant.

Our assumption is that the salient features of each factor’s return pattern will remain the same (i.e. autocorrelation, conditional heteroskedasticity, skewness, kurtosis, et cetera), but the forward average annualized return will be zero since the factor no longer “works.”

Towards this end, we ran the following experiment: 

  1. Take the full history for the factor and calculate prior estimates for mean annualized return and standard error of the mean.
  2. De-mean the time-series.
  3. Randomly select a 12-month chunk of returns from the time series and use the data to perform a Bayesian update to our mean annualized return.
  4. Repeat step 3 until the annualized return is no longer statistically non-zero at a 99% confidence threshold.

For each factor, we ran this test 10,000 times, creating a distribution that tells us how many years into the future we would have to wait until we were certain, from a statistical perspective, that the factor is no longer significant.

Sixty-seven years.

Based upon this experience, sixty-seven years is median number of years we will have to wait until we officially declare price-to-book (“HML,” as it is known in the literature) to be dead.3  At the risk of being morbid, we’re far more likely to die before the industry finally sticks a fork in price-to-book.

We perform this experiment for a number of other factors – including size (“SMB” – “small-minus-big”), quality (“QMJ” – “quality-minus-junk”), low-volatility (“BAB” – “betting-against-beta”), and momentum (“UMD” – “up-minus-down”) – and see much the same result.  It will take decades before sufficient evidence mounts to dethrone these factors.

HMLSMB4QMJBABUMD
Median Years-until-Failure6743132284339

 

Now, it is worth pointing out that these figures for a factor like momentum (“UMD”) might be a bit skewed due to the design of the test.  If we examine the long-run returns, we see a fairly docile return profile punctuated by sudden and significant drawdowns (often called “momentum crashes”).

Since a large proportion of the cumulative losses are contained in these short but pronounced drawdown periods, demeaning the time-series ultimately means that the majority of 12-month periods actually exhibit positive returns.  In other words, by selecting random 12-month samples, we actually expect a high frequency of those samples to have a positive return.

For example, using this process, 49.1%, 47.6%, 46.7%, 48.8% of rolling 12-month periods are positive for HML, SMB, QMJ, and BAB factors respectively.  For UMD, that number is 54.7%.  Furthermore, if you drop the worst 5% of rolling 12-month periods for UMD, the average positive period is 1.4x larger than the average negative period.  Taken together, not only are you more likely to select a positive 12-month period, but those positive periods are, on average, 1.4x larger than the negative periods you will pick, except for the rare (<5%) cases.

The process of the test was selected to incorporate the salient features of each factor.  However, in the case of momentum, it may lead to somewhat outlandish results.

Conclusion

While an evidence-based investor should be swayed by the weight of the data, the simple fact is that most factors are so well established that the majority of current practitioners will likely go our entire careers without experiencing evidence substantial enough to dismiss any of the anomalies.

Therefore, in many ways, there is a certain faith required to use them going forward. Yes, these are ideas and concepts derived from the data.  Yes, we have done our best to test their robustness out-of-sample across time, geographies, and asset classes.  Yet we must also admit that there is a non-zero probability, however small it is, that these are false positives: a fact we may not have sufficient evidence to address until several decades hence.

And so a bit of humility is warranted.  Factors will not suddenly stand up and declare themselves broken.  And those that are broken will still appear to work from time-to-time.

Indeed, the death of a factor will be more Fimulwinter than Ragnarok: not so violent to be the end of days, but enough to cause pain and frustration among investors.

 

Addendum

We have received a large number of inbound notes about this commentary, which fall upon two primary lines of questions.  We want to address these points.

How were the tests impacted by the Bayesian inference process?

The results of the tests within this commentary are rather astounding.  We did seek to address some of the potential flaws of the methodology we employed, but by-in-large we feel the overarching conclusion remains on a solid foundation.

While we only presented the results of the Bayesian inference approach in this commentary, as a check we actually tested two other approaches:

  1. A Bayesian inference approach assuming that forward returns would be a random walk with constant variance (based upon historical variance) and zero mean.
  2. Forward returns were simulated using the same bootstrap approach, but the factor was being discovered for the first time and the entire history was being evaluated for its significance.

The two tests were in effort to isolate the effects of the different components of our test.

What we found was that while the reported figures changed, the overall  magnitude did not.  In other words, the median death-date of HML may not have been 67 years, but the order of magnitude remained much the same: decades.

Stepping back, these results were somewhat a foregone conclusion.  We would not expect an effect that has been determined to be statistically significant over a hundred year period to unravel in a few years.  Furthermore, we would expect a number of scenarios that continue to bolster the statistical strength just due to randomness alone.

Why are we defending price-to-book?

The point of this commentary was not to defend price-to-book as a measure.  Rather, it was to bring up a larger point.

As a community, quantitative investors often leverage statistical significance as a defense for the way we invest.

We think that is a good thing.  We should look at the weight of the evidence.  We should be data driven.  We should try to find ideas that have proven to be robust over decades of time and when applied in different markets or with different asset classes.  We should want to find strategies that are robust to small changes in parameterization.

Many quants would argue (including us among them), however, that there also needs to be a why.  Why does this factor work?  Without the why, we run the risk of glorified data mining.  With the why, we can choose for ourselves whether we believe the effect will continue going forward.

Of course, there is nothing that prevents the why from being pure narrative fallacy.  Perhaps we have simply weaved a story into a pattern of facts.

With price-to-book, one might argue we have done the exact opposite.  The effect, technically, remains statistically significant and yet plenty of ink has been spilled as to why it shouldn’t work in the future.

The question we must answer, then, is, “when does statistically significant apply and when does it not?”  How can we use it as a justification in one place and completely ignore it in others?

Furthermore, if we are going to rely on hundreds of years of data to establish significance, how can we determine when something is “broken” if the statistical evidence does not support it?

Price-to-book may very well be broken.  But that is not the point of this commentary.  The point is simply that the same tools we use to establish and defend factors may prevent us from tearing them down.

 

Powered by WordPress & Theme by Anders Norén