Optimization algorithms come in many shapes and sizes, each designed to target a specific type of problem. Below we have provided a brief taxonomy of the major classifications of optimization algorithms. It is by no means a full or authoritative classification, just a brief introduction to the expansive world of optimization algorithms used by quantitative practitioners.
Note: This is a fairly technical post and understanding its every detail is not necessary for those interested in the intuition of behavior and risks of optimization procedures. As quantitative asset management has become more popular, firms have begun naming techniques in their literature with little to no explanation as to what they are. Our goal is to de-mystify these concepts to enable non-quantitative practitioners to make more educated decisions regarding the strategies that employ these concepts. And if you ask someone what sort of optimization method they are using and they have no idea, you should run, not walk, the other way…
There exists a whole series of optimization algorithms that fall into the “____ Programming” classification, including Convex programming, Linear programming, Integer programming, and Quadratic programming to name a few. These types of problems have canonical forms that require the minimization of some function (typically corresponding to the name of the optimization; e.g. Quadratic programming has a quadratic function) subject to some linear constraints. When the functional form of a problem is well known, it is re-written in the canonical form and solved with a standard set of optimization algorithms. The trick, of course, is both in knowing the functional form of your problem and then being able to translate it into the canonical form — many interesting problems cannot be translated into the standard forms.
Traditional mean-variance optimization is an example of a problem that can be translated into the form required by Quadratic programming and can be neatly solved. Once we leave the realm of well formulated problems, however, and begin exploring functions that cannot be neatly packaged as matrix multiplications, we have to begin exploring other optimization methods.
Most academic optimization problems in portfolio management fall into this category, and if someone is using this form, you can feel certain that their result is the optimal value — assuming, of course, they correctly translated their problem into the canonical form.
We can envision gradient algorithms like the marble example from the first part of this series: gravity and friction fight, drawing us downhill until we settle in a stable location. This classification of algorithm is best used with smooth, convex functions without many local minima. While these restrictions are unrealistic for most data-driven problems, which tend to have coarse and complex surfaces, the process is well suited for deriving parameters in theoretical frameworks where we know the functional form (such as Black-Scholes). By placing such restrictions on the form, we know that convergence will be very fast. Below we plot the function $latex f(x,y)=x^2+y^2$, select a random starting point, and use a gradient descent method to find the global minimum at (0,0).
Complexities in designing ascent / descent algorithms arise in how we choose to estimate which direction to move in and how far to move with each step. Numerical inaccuracies in estimation and incorrect algorithm parameterizations mean that not all smooth functions are easy to optimize. Consider the devilishly difficult Rosenbrock valley, designed to stump this classification of optimization algorithms. Because the valley is so flat, our traditional descent algorithms have a difficult time converging towards the global minimum if the steps they take are too big. In fact, if the algorithm lands on a point that has too shallow a gradient, the next step may send it careening away. The following two videos show what can happen in your optimization if you happen to parameterize it incorrectly:
In the first video we see that while convergence towards (0,0) at the end is very slow, the descent is controlled because we move in small steps.
In this second video, we see that with a larger step size, the algorithm spirals out of control and never even approaches the global maximum. Same algorithm, different step-size.
These sorts of algorithms are heavily used for smooth, non-linear functions. Excel’s solver is, by default, powered by a gradient method. If someone tells you they are using a gradient method and you have reason to believe that the function they are optimizing over will not be smooth, ask them how they are guaranteeing convergence to the optimal location.
Stochastic is a fancy word for random, and that’s really the whole principal behind these methods. Randomness is injected into the search procedure to accelerate progress, helping to overcome local minima and noise from modeling error. This process allows us to rapidly learn about our search space without making any assumptions. In a strange way, randomness allows us to learn about something that is deterministic.
There are a variety of techniques that fall under stochastic optimization, but stochastic techniques can also be applied to other models to create hybrid solutions for solving complex methods. For example, it is fairly common to include a stochastic element to gradient methods, either randomly resetting the particle’s location, randomly perturbing the direction the particle is moving in, or randomly dropping multiple particles, to help overcome the risk of local minima.
Swarm optimization is a subset of stochastic optimization, but comes in such a variety of flavors that we thought it deserved its own category. One of our favorite types of stochastic optimization at Newfound is particle swarm optimization, population-based method swarm intelligence methodology. The process works by simulating a series of “particles,” clustered into “swarms.” Each particle begins by moving in a random direction, but overtime is influenced by its memory of its own personal best fitness (memory) and the best fitness of its swarm (information sharing). This allows us to search complex terrain, have individual particles end up in sub-optimal solutions, but still have a good chance at finding the global minimum. Once again, we consider the Rastrigin function, which we previously determined would stump gradient descent methods. By applying a bit of randomness to the problem…
Voilà! While many individual particles, and even whole swarms, became stuck in local minima, by having enough particles initialized in random locations we are able to find the global minimum. This methodology is useful because it easily extends to multiple dimensions (remember, optimizing a portfolio with 30 assets is a 30-dimensional problem). Below we perform the same optimization, but for a 3-dimensional Rastrigin function. Plotting the surface would require 4 dimensions, so we simply plot the location of the particles and color them based on their current fitness (the darker the particle, the better its fitness relative to other particles). We can see that they again converge to the global minimum location (0,0,0).
Swarm intelligence algorithms are a very popular form of stochastic optimization and are heavily influenced by nature-based optimization techniques. There have been algorithms developed based on the behavior of ant colonies, bee hives, immune systems, bats, cuckoos, fireflies, glowworms, and krill, to name a few. These natural processes are highly influential because tend to strike a balance between continued global exploration and local exploitation very well. In our above particle swarm optimizations, the process continues to converge towards the global minimum (local exploitation) over time while several particles continue to explore (global exploration).
Despite their stochastic nature, swarm intelligence methods can still suffer from pre-mature convergence to sub-optimal solutions, especially since we tend to utilize them on less-smooth functional surfaces. As with any tool, we have to intelligently select and parameterize the swarm intelligence method that is correct for our application.
Evolutionary algorithms are also another form of stochastic optimization and are inspired by biological evolution. The principal behind these methods is to use randomness with mating behavior and mutations to evolve our solution. Each particle is given a genetic makeup which is used in evaluating its fitness. Those particles with greater fitness will reproduce more, while those with lower fitness will die off and be replaced. Reproduction uses cross-over methods, incorporating the genes of both parents, but also includes a chance of mutation, allowing the new generation to further evolve.
In the video below, we show an attempt at solving the Rastrigin function using genetic algorithms. In this process, the genetic makeup for each particle is its (x,y) location. Every step, we keep the top 65% of particles, create a new 25% from mating, and randomly spawn the last 10%. When two particles mate, the child particle is given some combination of its parents genetic makeup. We see the results below.
What we see is that despite converging to the right solution, this problem is poorly suited for this type of algorithm. Why?
- There are only two pieces of genetic information, making cross-over fairly uninteresting
- The search space is continuous but the genetic algorithm can only move discretely. In other words, even if we come “close” to the correct answer in continuous space, we don’t have a method to get close
In fact, the only reason this problem converged is because the surface is symmetric and the solution is at (0,0) — cross-over eventually got it there. We wouldn’t be so lucky in a real-world problem.
We’ve seen a dramatic rise in interest in evolutionary methods as of late. Systematic investors and traders are using them to:
- Determine the weights on their decision variables
- Select the parameters for their models
- Construct the investment or trading formula itself
The danger is when these methods are taken to the extreme, as we are seeing many firms do, and allowing evolution to completely determine their process. Many firms have begun distilling dozens, if not hundreds, of decision variables into a genetic soup and using backtests to determine the optimal functional form and parameters for their strategy. This is nothing but masked data mining, a dangerous technique for long-term portfolio robustness, and something we will explore in the next part of our series.