Is out-of-sample testing of forecasting models a myth?

When working with forecasting models, a well-known observation is that in-sample performance is usually better, often much better, than out-of-sample performance. That is, a model generally produces better forecasts over the data that it was constructed on than over new data. Researchers usually attribute this result to the process of data mining, which leads to in-sample overfitting. As many different variables and model specifications are tried and discarded, the final model is likely capturing the idiosyncratic features of the in-sample data. Another independent cause of poor out-of-sample performance is the presence of structural instability in the data.

A popular strategy to deal with the problem of overfitting, especially in machine learning, is to use cross-validation. The data are split, once or several times, and the model is estimated using part of the data and validated using the remaining data. Then we select the model with the best performance over the validation sample(s). Cross-validation works when the observations are independent and identically distributed. This property is likely to hold in many engineering applications such as recognizing objects in images. When modeling financial and economic time series data, however, the observations are generally dependent and this property does not hold.

To make the discussion more concrete, consider the widely cited paper by Goyal and Welch (2008) that tests the ability of various financial ratios and indicators to predict the returns of the S&P 500 index over the next one-to-twelve months using a linear regression. They examine variables such as the dividend price ratio, earnings price ratio, default yield spread, and inflation, along with several others identified in the literature as having predictive power for stock returns. They focus on testing each variable separately although they also run a ‘kitchen-sink’ multiple regression, which includes all variables.

Goyal and Welch note that prior studies of stock market predictability typically relied on in-sample evidence. Instead, they use a procedure that splits the data into an initial estimation period and an out-of-sample period. They estimate the model and generate a forecast for, say, the first month in the out-of-sample period. Next they recursively expand the estimation period by one month, reestimate the model, and generate another month-ahead forecast. By repeating these steps, they create a series of monthly out-of-sample forecasts. This procedure is sometimes called pseudo-out-of-sample model testing since the out-of-sample data are not strictly speaking new data.

Using these out-of-sample tests, Goyal and Welch (2008) find very little evidence of predictive ability for most of the variables compared to a benchmark forecast of stock returns based on the historical mean. The paper was very influential and caused finance researchers to employ similar out-of-sample tests in subsequent studies of asset return predictability.

A common misconception seems to be that these out-of-sample tests prevent overfitting. After all a new forecasting variable or modeling technique needs to demonstrate predictive ability not only in-sample, but also out-of-sample. However, it’s important to keep in mind that these are pseudo-out-of-sample tests. In practice, it is almost as easy to data mine an out-of-sample test as an in-sample test. This point was made earlier by Inoue and Kilian (2004). As in the new Tom Cruise movie, The Edge of Tomorrow, that opened this weekend, the researcher merely needs to hit the reset button and try another model specification and keep hitting the reset button until she obtains a significant out-of-sample as well as in-sample result.

I’m not suggesting that researchers deliberately try to misstate the significance of their results. However, in practice there are so many knobs and dials that are under the discretion of the researcher that the specification search can be quite subtle. Critically this search can include repeated use pseudo-out-of-sample tests. Coupled with the bias towards publishing positive results, investors need to be cautious in interpreting the findings of recent stock market forecasting studies since showing strong out-of-sample performance has become standard procedure.

Furthermore, a moment’s reflection reveals that the same criticism of repeated testing of a given data set applies to papers in machine learning that use cross-validation. Indeed The Economist reported in October 2013 that:

According to some estimates, three-quarters of published scientific papers in the field of machine learning are bunk because of the ‘overfitting’, says Sandy Pentland, a computer scientist at the Massachusetts Institute of Technology.

Data could be made genuinely out-of-sample by withholding some observations (as done in machine learning competitions such as those organized by Kaggle). Researchers could also retest stock forecasting models after new data become available over time. This approach could be feasible for models used in high frequency trading, but would require a wait of several years for those used in low frequency trading.

So are (pseudo) out-of-sample tests completely useless? Goyal and Welch motivate their use of out-of-sample tests by asking whether investors could have used the various variables to predict the stock market in real-time. The tests address this question since they give each model only the data that would have been available on each date. They also emphasize the role of out-of-sample tests as a diagnostic test for structural instability in the data that is complementary to in-sample tests. Finally, the tests are successful in detecting the overfitting that occurs from using the kitchen-sink regression, which has the best in-sample fit (not surprisingly), but terrible out-of-sample performance. The bottom line is that pseudo-out-of-sample tests can give valuable information, but are no panacea for data mining if misused.

References

Diebold, F. 2013. Comparing Predictive Accuracy, Twenty Years Later. A Personal Perspective on the Use and Abuse of Diebold-Mariano Tests. Working paper, University of Pennsylvania.

Inoue, A. and L. Kilian. 2004. In-Sample or Out-of-Sample Tests of Predictability: Which One Should We Use? Econometric Reviews, 23, 371-402.

Goyal, A., I. Welch. 2008. A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies, 21, 1455–1508.

Unreliable Research. Trouble at the Lab. The Economist. 19 October 2013. Retrieved from http://www.economist.com.

This entry was posted in machine learning, stock market and tagged , , . Bookmark the permalink.

1 Response to Is out-of-sample testing of forecasting models a myth?

Leave a reply to Jerzy Pawlowski (@JerzyPawlowski2) Cancel reply