This post is available as a PDF download here.

Summary

  • Defensive equity strategies are comprised of stocks that lose less than the market during bear markets while keeping up with the market during a bull market.
  • Coarse sorts on metrics such as volatility, beta, value, and momentum lead to diversified portfolios but have mixed results in terms of their defensive characteristics, especially through different crisis periods that may favor one metric over another.
  • Using non-linear machine learning techniques is a desirable way to identify certain combinations of factors that lead to better defensive equity strategies over multiple periods.
  • By applying techniques such as random forests and gradient boosting to two sample defensive equity metrics, we find that machine learning does not add significant value over a low volatility sort, given the features included in the model.
  • While this by no means rules out the benefits of machine learning techniques, it shows how a blanket application of it is not a panacea for investing during crisis periods.

There is no shortage of hypotheses as to what characteristics define a stock that will outperform in a bear market.  Some argue that value stocks should perform well, given their relative valuation buffer (the “less far to fall” argument).  Some argue for a focus on balance sheet strength while others argue that cash-flow is the ultimate life blood of a company and should be prioritized.  There are even arguments for industry preferences based upon economic cyclicality.

Each recession and crisis is unique, however, and therefore the characteristics of stocks that fare best will likely change.  For example, the dot-com run-up caused a large number of real-economy businesses to be sorted into the “cheap” bucket of the value factor.  These companies also tended to have higher quality earnings and lower beta / volatility than the dot-com stocks.

Common sense would indicate that unconstrained value may be a natural counter-hedge towards large, speculative bubbles, but we need only look towards 2008 – a credit and liquidity event – to see that value is not a panacea for every type of crisis.

It is for this reason that some investors prefer to take their cues from market-informed metrics such as beta, volatility, momentum, or trading volume.

Regardless of approach, there are some philosophical limitations we should consider when it comes to expectations with defensive equity portfolios.  First, if we were able to identify an approach that could avoid market losses, then we would expect that strategy to also have negative alpha.1 If this were not the case, we could construct an arbitrage.

Therefore, in designing a defensive equity portfolio, our aim should be to provide ample downside protection against market losses while minimizing the relative upside participation cost of doing so.

Traditional linear sorts – such as buying the lowest volatility stocks – are coarse by design.  They aim to robustly capture a general truth and hedge missed subtleties through diversification.  For example, while some stocks deserve to be cheap and some stocks are expensive for good reason, naïve value sorts will do little to distinguish them from those that are unjustifiably cheap or rich.

For a defensive equity portfolio, however, this coarseness may not only reduce effectiveness, but it may also increase the implicit cost.  Therefore, in this note we implement non-linear techniques in an effort to more precisely identify combinations of characteristics that may create a more effective defensive equity strategy.

The Strategy Objective

To start, we must begin by defining precisely what we mean by a “defensive equity strategy.”  What are the characteristics that would make us label one security as defensive and another as not?  Or, potentially better, is there a characteristic that allows us to rank securities on a gradient of defensiveness?

This is not a trivial decision, as our entire exercise will attempt to maximize the probability of correctly identifying securities with this characteristic.

As our goal is to find those securities which provide the most protection during equity market routs but bleed the least during equity market rallies, we chose a metric that scored how closely a stock’s return reflected the payoff of a call option on the S&P 500 over the next 63 trading days (approximately 3 months).

In other words, if the S&P 500 is positive over the next 63 trading days, the score of a security is equal to the squared difference between its return and the S&P 500’s return.  If the market’s return is negative, the score of a security is simply its squared return.

To determine whether this metric reflects the type of profile we want, we can create a long/short portfolio.  Each month we rank securities by their scores and select the quintile with the lowest scores.  Securities are then weighted by their market capitalization.  Securities are held for three months and the portfolio is implemented with three tranches.  The short leg of the portfolio is the market rather than the highest quintile, as we are explicitly trying to identify defense against the market.

To create a scalable solution, we restrict our investable universe to those in the top 1,000 securities by market capitalization.

We plot the performance below.

Source: Sharadar Fundamentals.  Calculations by Newfound Research.  Returns are hypothetical and backtested.  Returns are gross of all fees including, but not limited to, management fees, transaction fees, and taxes.  Returns assume the reinvestment of all distributions.

We can see that the strategy is relatively flat during bull markets (1998-2000, 2003-2007, 2011-2015, 2016-2018), but rallies during bear markets and sudden market shocks (2000-2003, 2008, 2011, 2015/2016, Q4 2018, and 2020).

Interestingly, despite having no sector constraints and not explicitly targeting tracking error at the portfolio level, the resulting portfolio ends up well diversified across sectors, though it does appear to make significant short-term jumps in sector weights.  We can also see an increasing tilt towards Technology over the last 3 years in the portfolio.  In recent months, positions in Financials and Industrials have been almost outright eliminated.

Source: Sharadar Fundamentals.  Calculations by Newfound Research. 

Of course, this metric is explicitly forward looking.  We’re using a crystal ball to peer into the future and identify those stocks that track the best on the way up and protect the best on the way down.  Our goal, then, is to use a variety of company and security characteristics to accurately forecast this score.

We will include a variety of characteristics and features, including:

  • Size: Market Capitalization.
  • Valuation: Book-to-Price, Earnings-to-Price, Free Cash Flow-to-Price, Revenue-to-EV, and EBITDA-to-EV.
  • Momentum: 12-1 Month Return and 1-Month Return.
  • Risk: Beta, Volatility, Idiosyncratic Volatility, and Ulcer Index.
  • Quality: Accruals, ROA, ROE, CFOA, GPOA, Net Margin, Asset Turnover, Leverage, and Payout Ratio.
  • Growth: Internal Growth Rate, EPS Growth, Revenue Growth.

These 24 features are all cross-sectionally ranked at each point in time.  We also include dummy variables for each security to represent sector inclusion as well as whether the company has positive Net Income and whether the company has positive Operating Cash Flow.

Note that we are not including any market regime characteristics, such information about market returns, volatility, interest rates, credit spreads, sentiment, or monetary or fiscal policy.  Had we included such features, our resulting model may end up as a factor switching approach, changing which characteristics it selects based upon the market environment.  This may be an interesting model in its own right, but our goal herein is simply to design a static, non-linear factor sort.

Random Forests

Our first approach will be to apply a random forest algorithm, which is an ensemble learning method.  The approach uses a training data set to build a number of individual decision trees whose results are then re-combined to create the ultimate decision.  By training each tree on a subset of data and considering only a subset of features for each node, we can create trees that may individually have high variance, but as an aggregate forest reduce variance without necessarily increasing bias.

As an example, this means that one tree may be built using a mixture of low volatility and quality features, while another may be built using valuation and momentum features.  Each tree is able to model a non-linear relationship, but by restricting tree depth and building trees using random subsets of data and features, we can prevent overfitting.

There are a number of hyperparameters that can be set to govern the model fit.  For example, we can set the maximum depth of the individual trees as well as the number of trees we want to fit.  Fitting hyperparameters is an art unto itself, and rather than go down the rabbit hole of tuning hyperparameters via cross-validation, we did our best to select reasonable hyper parameters.  We elected to train the model on 50% of our data (March 1998 to March 2009), with a total of 100 trees each with a maximum depth of 2.

The results of the exercise are plotted below.

Source: Sharadar Fundamentals.  Calculations by Newfound Research. 

The performance does appear to provide defensive properties both in- and out-of-sample, with meaningful returns generated in 2000-2002, 2008, Q3 and Q4 of 2011, June 2015 through June 2016, and Q4 2008.

We can see that the allocations also express a number of static sector concentrations (e.g. Consumer Defensive) as well as some cyclical changes (e.g. Finances pre- and post-2007).

We can also gain insight into how the portfolio composition changes by looking at the weighted characteristic scores of the long leg of the portfolio over time.

Source: Sharadar Fundamentals.  Calculations by Newfound Research. 

It is important to remember that characteristics are cross-sectionally ranked across stocks.  For some characteristics, higher is often considered better (e.g. a higher earnings-to-price cheaper is considered cheaper), whereas for other factors lower is better (e.g. lower volatility is considered to have less risk).

We can see that some characteristics are static tilts: higher market capitalization, positive operating cash flow, positive net income, and lower risk characteristics.  Other characteristics are more dynamic.  By 12/2008, the portfolio has tilted heavily towards high momentum stocks.  A year later, the portfolio has tilted heavily towards low momentum stocks.

What is somewhat difficult to disentangle is whether these static and dynamic effects are due to the non-linear model we have developed, or whether it’s simply that applying static tilts results in the dynamic tilts.  For example, if we only applied a low volatility tilt, is it possible that the momentum tilts would emerge naturally?

Unfortunately, the answer appears to be the latter.  If we plot a long/short portfolio that goes long the bottom quintile of stocks ranked on realized 1-year volatility and short the broad market, we see a very familiar equity curve.

Source: Sharadar Fundamentals.  Calculations by Newfound Research.  Returns are hypothetical and backtested.  Returns are gross of all fees including, but not limited to, management fees, transaction fees, and taxes.  Returns assume the reinvestment of all distributions.

It would appear that the random forest model effectively identified the benefits of low volatility securities.  And while out-of-sample performance does appear to provide more ample defense during 2011, 2015-2016, and 2018 than the low volatility tilt, it also has significantly greater performance drag.

Gradient Boosting

One potential improvement we might consider is to apply a gradient boosting model.  Rather than simply building our decision trees independently in parallel, we can build them sequentially such that each tree is built on a modified version of the original data set (e.g. increasing the weights of those data points that were harder to classify and decreasing the weights on those that were easier).

Rather than just generalize to a low-volatility proxy, gradient boosting may allow our decision tree process to pick up upon greater subtleties and conditional relationships in the data.  For comparison purposes, we’ll assume the same maximum tree depth and number of trees as the random forest method.

In initially evaluating the importance of features, it does appear that low volatility remains a critical factor, but other characteristics – such as momentum, free cash flow yield, and payout ratio – are close seconds.  This may be a hint that gradient boosting was able to identify more subtle relationships.

Unfortunately, in evaluating the sector characteristics over time, we see a very similar pattern.  Though we can notice that sectors like Technology have received a meaningfully higher allocation with this methodology versus the random forest approach.

Source: Sharadar Fundamentals.  Calculations by Newfound Research. 

If we compare long/short portfolios, we find little meaningful difference to our past results.  Our model simply seems to identify a (historically less effective) low volatility model.

Source: Sharadar Fundamentals.  Calculations by Newfound Research.  Returns are hypothetical and backtested.  Returns are gross of all fees including, but not limited to, management fees, transaction fees, and taxes.  Returns assume the reinvestment of all distributions.

Re-Defining Defensiveness

When we set out on this problem, we made a key decision to define a stock’s defensiveness by how closely it is able to replicate the payoff of a call option on the S&P 500.  What if we had elected another definition, though?  For example, we could define defensive stocks as those that minimize the depth and frequency of drawdowns using a measure like the Ulcer Index.

Below we replicate the above tests but use forward 12-month Ulcer Index as our target score (or, more precisely, a security’s forward 12-month cross-sectional Ulcer Index rank).

We again begin by constructing an index that has perfect foresight, buying a market-capitalization weighted portfolio of securities that rank in the lowest quintile of forward 12-month ulcer index.  We see a very different payoff profile than before, with strong performance exhibited in both bull and bear markets.

By focusing on forward 12-month scores rather than 3-month scores, we also see a far steadier sector allocation profile over time.  Interestingly, we still see meaningful sector tilts over time, with sectors like Technology, Financials, and Consumer Defensives coming in and out of favor over time.

Source: Sharadar Fundamentals.  Calculations by Newfound Research.  Returns are hypothetical and backtested.  Returns are gross of all fees including, but not limited to, management fees, transaction fees, and taxes.  Returns assume the reinvestment of all distributions.

We again use a gradient boosted random forest model to try to model our target scores.  We find that five of the top six most important features are price return related, either measuring return or risk.

Despite the increased emphasis on momentum, the resulting long/short index still echoes a naïve low-volatility sort.  This is likely because negative momentum and high volatility have become reasonably correlated proxies for one another in recent years.

While returns appear improved from prior attempts, the out-of-sample performance (March 2009 and onward) is almost identical to that of the low-volatility long/short.

Source: Sharadar Fundamentals.  Calculations by Newfound Research.  Returns are hypothetical and backtested.  Returns are gross of all fees including, but not limited to, management fees, transaction fees, and taxes.  Returns assume the reinvestment of all distributions.

Conclusion

In this research note we sought to apply machine learning techniques to factor portfolio construction.  Our goal was to exploit the ability of machine learning models to model non-linear relationships, hoping to come up with a more nuanced definition of a defensive equity portfolio.

In our first test, we defined a security’s defensiveness by how closely it was able to replicate the payoff of a call option on the S&P 500 over rolling 63-day (approximately 3-month) periods.  If the market was up, we wanted to pick stocks that closely matched the market’s performance; if the market was down, we wanted to pick stocks that minimized drawdown.

After pre-engineering a set of features to capture both company and stock dynamics, we first turned to a random forest model.  We chose this model as the decision tree structure would allow us to model conditional feature dynamics.  By focusing on generating a large number of shallow trees we aimed to avoid overfitting while still reducing overall model variance.

Training the model on data from 1999-2009, we found that the results strongly favored companies exhibiting positive operating cash flow, positive earnings, and low realized risk characteristics (e.g. volatility and beta).  Unfortunately, the model did not appear to provide any meaningful advantage versus a simple linear sort on volatility.

We then turned to applying gradient boosting to our random forest.  This approach builds trees in sequence such that each tree seeks to improve upon the last.  We hoped that such an approach would allow the random forest to build more nuance than simply scoring on realized volatility.

Unfortunately, the results remained largely the same.

Finally, we decided to change our definition of defensiveness by focusing on the depth and frequency of drawdowns with the Ulcer Index.  Again, after re-applying the gradient boosted random forest model, we found little difference in realized results versus a simple sort on volatility (especially out-of-sample).

One answer for these similar results may be that our objective function is highly correlated to volatility measures.  For example, if stocks follow a geometric Brownian motion process, those with higher levels of volatility should have deeper drawdowns.  And if the best predictor of future realized volatility is past realized volatility, then it is no huge surprise that the models ultimately fell back towards a naïve volatility sort.

Interestingly, value, quality, and growth characteristics seemed largely ignored.  We see two potential reasons for this.

The first possibility is that they were simply subsumed by low volatility with respect to our objective.  If this were the case, however, we would see little feature importance placed upon them, but would still expect their weighted average characteristic scores within our portfolios to be higher (or lower).  While this is true for select features (e.g. payout ratio), the importance of others appears largely cyclical (e.g. earnings-to-price).  In fact, during the fall out of the dot-com bubble, weighted average value scores remained between 40 and 70.

The second reason is that the fundamental drivers behind each market sell-off are different.  Factors tied to company metrics (e.g. valuation, quality, or growth), therefore, may be ill-suited to navigate different types of sell offs.  For example, value was the natural antithesis to the speculative dot-com bubble.  However, during the recent COVID-19 crisis, it has been the already richly priced technology stocks that have fared the best.  Factors based upon security characteristics (e.g. volatility, returns, or volume) may be better suited to dynamically adjust to market changes.

While our results were rather lackluster, we should acknowledge that we have really only scratched the surface of machine learning techniques.  Furthermore, our results are intrinsically linked to how we’ve defined our problem and the features we engineered.  A more thoughtful target score or a different set of features may lead to substantially different results.