Home » Research Blog » Modern backtesting with integrity

Modern backtesting with integrity

Machine learning offers powerful tools for backtesting trading strategies. However, its computational power and convenience can also be corrosive for financial investment due to its tendency to find temporary patterns while data samples for cross validation are limited. Machine learning produces valid backtests only when applied with sound principles. These should include [1] formulating a logical economic theory up front, [2] choosing sample data up front, [3] keeping the model simple and intuitive, [4] limiting try-outs when testing ideas, [5] accepting model decay overtime rather than ‘tweaking’ specifications, and [6] remaining realistic about reliability. The most important principle of all is integrity: aiming to produce good research rather than good backtests and to communicate statistical findings honestly rather than selling them.

Arnott, Robert, Campbell Harvey and Harry Markowitz (2018), “A Backtesting Protocol in the Era of Machine Learning” (November 21, 2018).

The post ties in with SRSV’s summary lecture on macro information efficiency.
The below are quotes from the paper. Emphasis and cursive text have been added.

The importance of machine learning (and of knowing its pitfalls)

“Today, both data and computing resources are cheap, and in the era of machine learning, researchers no longer even need to specify a hypothesis—the algorithm will supposedly figure it out… We need to be careful in applying these tools…With large data, patterns will emerge purely by chance. One of the big advantages of machine learning is that it is hardwired to try to avoid overfitting by constantly cross-validating discovered patterns…[However] this advantage performs well [only] in the presence of a large amount of data…Machine learning techniques have been widely deployed for uses ranging from detecting consumer preferences to autonomous vehicles, all situations that involve big data. The large amount of data allows for multiple layers of cross-validation, which minimizes the risk of overfitting. We are not so lucky in finance. “

In investment finance, apart from tick data, the data are much more limited in scope. Today, we have about 55 years of high-quality equity data (or less than 700 monthly observations)… Macroeconomic data, generally available on a monthly or quarterly basis, are largely offside for most machine learning applications. Over the post-1960 period,10 just over 200 quarterly observations and fewer than 700 monthly observations exist…This tiny sample is far too small for most machine learning applications, and impossibly small for advanced approaches such as deep learning.”

“Machine learning implementations would carefully cross-validate the data by training the algorithm on part of the data and then validating on another part of the data….However, in a simple implementation [with limit available time series data]…it is possible that a false strategy can work in the cross-validated sample…Tuning 10 different hyper-parameters using k-fold cross-validation is a terrible idea if you are trying to predict returns with 50 years of data…It might be okay if you had millions of years of data.”

“Machine learning, and particularly unsupervised machine learning, does not impose economic principles. If it works, it works in retrospect, but not necessarily in the future.”

The importance of economic theory

When data are limited, economic foundations become more important…Our data are limited. We cannot flip a switch at a particle accelerator and create trillions of fresh (not simulated) out-of-sample data. But we are lucky in that finance theory can help us filter out ideas that lack an ex ante economic basis.

“The hypothesis provides a discipline that reduces the chance of overfitting. Importantly, the hypothesis needs to have a logical foundation.”

“Without an economic foundation for the model, the researcher maximizes the chance that the model will not work when taken into live trading. This is one of the drawbacks of machine learning.”

An economic foundation should exist first…It is almost always a mistake to create an economic story—a rationale to justify the findings—after the data mining has occurred. The story is often flimsy, and if the data mining had delivered the opposite result, the after-the-fact story might easily have been the opposite.”

The importance of choosing data samples up front

“The training sample needs to be justified in advance. The sample should never change after the research begins. For example, suppose the model ‘works’ if the sample begins in 1970, but does not ‘work’ if the sample begins in 1960—in such a case, the model does not work.”

Cleaning the data before employing machine learning techniques in the development of investment models is crucial. Interestingly, some valuable data science tools have been developed to check data integrity…Flawed data can lead researchers astray…A nonlinearity might simply be a bad data point.”

“Outliers are influential observations for the model. Inclusion or exclusion of influential observations can make or break the model. Ideally, a solid economic case should be made for exclusion—before the model is estimated. In general, no influential observations should be deleted. Assuming the observation is based on valid data, the model should explain all data, not just a select number of observations.”

“Winsorized data are truncated at a certain threshold (e.g., truncating outliers to the 1% or 2% tails) rather than deleted. Winsorization is a useful tool, because outliers can have an outsize influence on any model. But, the choice to winsorize, and at which level, should be decided before constructing the model.”

The importance of simplicity

“Multidimensionality [using many predictive variables] works against the viability of machine learning applications; the reason is related to the limitations of data. Every new piece of information increases dimensionality and requires more data…As each possible predictor variable is added, more data are required, but history is limited and new data cannot be created or simulated.”

“Current machine learning tools are designed to minimize the in-sample overfitting by extensive use of cross-validation. Nevertheless, these tools may add complexity (which is potentially non-intuitive) which leads to disappointing performance in the true out-of-sample live trading. The greater the complexity, and the greater the reliance on non-intuitive relationships, the greater the likely slippage between backtest simulations and live results.”

The importance of limiting try-outs

“Keeping track of the number of strategies tried is crucial as well as is measuring their correlations.”

“Given 20 randomly selected strategies, one will likely exceed the two-sigma threshold (t-statistic of 2.0 or above) purely by chance. As a result, the t-statistic of 2.0 is not a meaningful benchmark if more than one strategy is tested.”

“A bigger penalty in terms of threshold is applied to strategies that are relatively uncorrelated. For example, if the 20 strategies tested had a near 1.0 correlation, then the process is equivalent to trying only one strategy.”

The importance of accepting strategy decay

“In physics, the Heisenberg Uncertainty Principle states that we cannot know a particle’s position and momentum simultaneously with precision. The more accurately we know one characteristic, the less accurately we can know the other. A similar principle can apply in finance. As we move from the study of past data into the live application of research, market inefficiencies are hardly static… They change with time and are often easily arbitraged away…The crossvalidated relations of the past may seem powerful for reasons that no longer apply or may dissipate merely because we are now aware of them and are trading based on them.”

Refrain from tweaking the model…Suppose the model is running, but not doing as well as expected. Such a case should not be a surprise because the backtest of the model is likely overfit to some degree. It may be tempting to tweak the model, especially as a means to improve its fit in recent, now in-sample data. While these modifications are a natural response to failure, we should be fully aware that they will generally lead to further overfitting of the model, and may lead to even worse live-trading performance.”

The importance of being realistic

“Researchers have lived through the hold-out sample and thus understand the history, are knowledgeable about when markets rose and fell, and associate leading variables with past experience. As such, no true out-of-sample data exists; the only true out of sample is the live trading experience.”

“Suppose a model is successful in the in-sample period, but fails out of sample. The researcher observes that the model fails for a particular reason. The researcher modifies the initial model so it then works both in sample and out of sample. This is no longer an out-of-sample test. It is overfitting.”

“Almost all of the investment research published in academic finance ignores transactions costs. Even with modest transactions costs, the statistical ‘significance’ of many published anomalies essentially vanishes. Any research on historical data needs to take transactions.”

The importance of integrity

“The investment industry rewards research that produces backtests with winning results. If we do this in actual asset management, we create a toxic culture that institutionalizes incentives to hack the data, to produce a seemingly good strategy. Researchers should be rewarded for good science, not good results. A healthy culture will also set the expectation that most experiments will fail to uncover a positive result.”

The most common mistake is being seduced by the data into thinking a model is better than it is. This mistake has a behavioral underpinning. Researchers want their model to work. They seek evidence to support their hypothesis—and all of the rewards that come with it. They believe if they work hard enough, they will find the golden ticket. This induces a type of ‘selection problem’ in which the models that make it through are likely to be a result of a biased selection process.”

“Models with strong results will be tested, modified, and retested, while models with poor results will be quickly expunged. This creates two problems. One is that some good models will fail in the test period, perhaps for reasons unique to the data set, and will be forgotten. The other problem is that researchers seek a narrative to justify a bad model that works well in the test period, again perhaps for reasons irrelevant to the future efficacy of the model. These outcomes are false negatives and false positives, respectively. Even more common than a false positive is an exaggerated positive, an outcome that seems stronger, perhaps much stronger, than it is likely to be in the future. In other areas of science, this phenomenon is sometimes called the ‘winner’s curse’…Once published, models rarely work as well as in the backtest.”


Related articles