Big Data and Economics

Lorie
Lorie and Fisher. Big Data circa 1960.

For much of its history as a discipline, economics has been trapped in a Small Data paradigm. Macroeconomists analyzed output and inflation using annual or quarterly data spanning several decades at best. Microeconomic datasets varied more widely in size, but even large cross-sectional surveys were rarely big by today’s standards. While the econometric methods developed were often highly sophisticated, they generally focused more on estimation of structural equations and testing economic theories than on prediction. Searching for patterns outside a theoretical framework was largely dismissed as data mining.

Thanks to the efforts of Jim Lorie and Larry Fisher in developing the CRSP database at Chicago in the 1960s, financial economists have enjoyed a relative wealth of high-quality data. Furthermore, there has always been an interest in prediction within finance. However, empirical stock market research, at least in academia, has typically focused on discovering anomalies in returns by sorting stocks along a single dimension such as market capitalization or book-to-market ratio. Occasionally researchers would double or even triple sort stocks into portfolios, but generally would not explore additional dimensions. Haugen and Baker (1996), which used a few dozen variables to predict stock returns, was a notable exception. More distressingly, the result of generations of researchers “spinning” the CRSP tapes in a collective data mining exercise was that many empirical findings discovered about stock returns were unreliable out-of-sample.

In recent years the data used in economic research have grown tremendously in both volume and variety. High frequency financial datasets are notable examples. Others include unstructured data including text from news and social media and audio from management conference calls discussing company earnings. While being somewhat late to the party, economists have increasingly adopted a Big Data paradigm. Reflecting these developments, NBER devoted it’s summer institute last year to Econometric Methods for High-Dimensional Data.

More evidence for the growing interest of economists in Big Data comes from Hal Varian’s article Big Data: New Tricks for Econometrics in the latest issue of the Journal of Economic Perspectives. Varian, a long-time economics professor at Berkeley before becoming the chief economist at Google, writes that “my standard advice to graduate students these days is go to the computer science department and take a class in machine learning.”

He makes a number of observations including (1) economists immediately think of a linear or logistic regression when confronted with a prediction problem despite there often being better models, (2) economists have not been as explicit as machine learning people about quantifying model complexity costs, and (3) cross-validation is a more realistic measure of performance than in-sample measures commonly used in economics. While these judgments are perhaps a bit too sweeping, they raise important issues that I will explore in future posts, particularly from the perspective of a financial markets investor.

This entry was posted in Big Data, machine learning and tagged , , . Bookmark the permalink.

Leave a comment