Is two still the magic number?

When doing data analysis, we have come to regard two as the threshold that a t-statistic must clear in order to declare a variable statistically significant. As most readers will know, this critical value ensures a 5% level of significance given a reasonably large number of observations. In the context of investments, we might be looking at whether a variable such as a firm’s book-to-market ratio is related to its stock returns or whether the time series of returns from some trading strategy or portfolio manager is reliably different from zero.

In my previous post on data snooping, I discussed some of the problems that arise from testing multiple hypotheses. In particular, if we conduct a large number of tests, we are very likely to find relationships or patterns in the data simply by chance. Yet, we do this all the time in finance when developing trading strategies or evaluating investment managers. So while two might be the appropriate threshold for a t-statistic when conducting a single test, it will leave us with too many strategies or managers bound to disappoint when we conduct a wide search.

Given a time series of returns, we could test the null hypothesis that the trading strategy has zero excess returns by calculating the t-statistic as \frac{\hat{\mu}}{\hat{\sigma}/\sqrt{T}} where  \hat{\mu} is the mean excess return, \hat{\sigma} the standard deviation of the returns, and T the number of observations. If the t-statistic is high enough, we would reject the null hypothesis and declare that the strategy has outperformed. The question, of course, is how high does the t-statistic (or equivalently, how small does the p-value) need to be?

Since the Sharpe Ratio is simply \frac{\hat{\mu}}{\hat{\sigma}}, we can rewrite the expression above as t-statistic = Sharpe Ratio x \sqrt{T}. Note that the time units must be consistent when using this formula so if we are given an annual Sharpe Ratio we need to express T in years. Intuitively, the formula says that an investment strategy (or manager) that maintains a high Sharpe Ratio for say 10 years is much more likely to have genuine alpha than one with similar performance over only 1 year.

Now let’s make the main argument more concrete. Suppose that we test M hypotheses. H_0 represents a null hypothesis, which could differ across the M tests. We have M_0 hypotheses with true nulls and M_1 hypotheses with true alternatives and M = M_0 + M_1. We test each hypothesis and either reject the null (significant) or not reject the null (not significant). Thus, we have four cases, which we can summarize in a 2 x 2 table familiar to anyone who has taken a statistics course. The numbers on the margins are the column and row totals.

Hyp_table

S is the number of true null hypotheses that are correctly not rejected. V is the number of true null hypotheses that are falsely rejected. A test result that falls into this category is known as a false discovery (or a Type 1 error). U is the number of false null hypotheses that are correctly rejected. T is the number of false nulls that are not rejected. A test result in this category is a missed discovery (or a Type 2 error). Note also that we know M and R and hence M-R, but not M_0 and M_1.

Define the family-wise error rate (FWE) as the probability of having at least one false discovery:

FWE = Pr(V\ge1)

Now suppose M = 1 (that is, we conduct exactly one test) and set the level of significance to the usual 5%. Then Pr(V\ge1) = 0.05. In other words, the chance of our making a false discovery is 5%. Now suppose instead of a single test, we have M = 20, which we assume for now to be independent. In this case, the FWE is given by

FWE = [1 - (1 - 0.05)^{20}] = 0.6415.

So the chance of making a false discovery (e.g. a trading strategy or manager that we mistakenly believe has true alpha, but in fact does not) jumps to nearly two-thirds. If we increase the number of tests to M = 100, then we have a near certainty of making a false discovery:

FWE = [1 - (1 - 0.05)^{100}] = 0.9941.

The Bonferroni adjustment, which I mentioned in a previous post, is one way in which to control the FWE.  Suppose we want to ensure that the FWE \le \alpha where \alpha is say 0.05. Then we simply need to use \alpha/M as the significance level for each individual test. Continuing our example, we now have the following FWEs for M=20 and M=100, respectively:

FWE = [1 - (1 - 0.05/20)^{20}] = 0.0488

FWE = [1 - (1 - 0.05/100)^{100}] = 0.0488

In both cases, Bonferroni successfully keeps the FWE below 5%, which is what we wanted. However, this success comes at a price. For M=100, we now require each test to have a p-value smaller than \alpha/M = 0.0005 in order to reject the null. (As an aside, we can do a little bit better than Bonferroni with independent tests since the adjustment pushed the FWE below 5%. Just set the RHS to 5% and solve for the significance threshold.)

What does this mean for the t-statistic hurdle when evaluating the returns for some investment strategy? The t-statistic (defined earlier) follows a t-distribution with T-1 degrees of freedom under the null (under some statistical assumptions on the returns that we won’t go into). Suppose we have T = 120 observations and use a two-tailed test. For a significance level of 0.05, the critical value is 1.9801 so our magic number of two works fine. For a significance level of 0.05/100 = 0.0005, the critical value becomes 3.5791. (Matlab has a function tinv(P,V) to calculate the inverse cumulative distribution function for student’s t where P is the probability and V is the degrees of freedom. The expression tinv(1-0.0005/2, 119) returns 3.5791.) This higher threshold, however, is very stringent and will very likely result in our missing trading strategies that actually do have alpha (that is, missed discoveries).

An improved method to control FWE is given by Holm (1979). The key insight is that Bonferroni doesn’t use all the information present in the collection of test statistics. We start by ordering the p-values for the M tests such that

P_{(1)}<P_{(2)}<\cdots<P_{(M)} with corresponding null hypotheses H_{(1)},H_{(2)},...,H_{(M)}.

Now define the function P_k=\frac{\alpha}{M+1-k} .

Start with the first test (k=1) and compare it’s p-value P_{(1)} with P_1. If it passes the hurdle (that is, P_{(1)}<P_1) then go on to the next test. When we first encounter a test that fails to meet the hurdle, we reject that test and all further tests (which have higher p-values). All hypotheses rejected by Bonferroni will also be rejected by Holm, but Holm may reject some additional hypotheses. Thus Holm has more power (the ability to reject null hypotheses that are false or in our context, identify investment strategies that have genuine alpha).

Both Bonferroni and Holm are methods to eliminate any false discoveries regardless of the number of tests and therefore may be too stringent for our purposes. Having one false discovery out of say twenty rejections is not nearly as bad as having one out of five. Furthermore, we might prefer utilizing many trading strategies, a few of which may be useless, to not finding any trading strategies because we set the statistical bar too high. The next post will discuss how to use this idea to control the proportion of false discoveries in the tests that we reject.

Summary

Multiple testing occurs frequently in finance when we examine many trading strategies or portfolio managers in an attempt to identify those with true alpha. We show that if we use hurdles for test statistics appropriate for a single test, we will very likely make a false discovery by virtue of having tested many hypotheses. Thus, the commonly used threshold of two for a t-statistic is too low. We describe the Bonferroni and Holm methods, which are both procedures to control the family-wise error rate or the probability of having at least one false discovery. With this groundwork in place, the next post will discuss ways in which to control the false discovery rate, which is arguably more suitable in the context of investments.

References

Holm, S., 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65-70.

Romano, Joseph P., Azeem M. Shaikh and Michael Wolf, “multiple testing”, “The New Palgrave Dictionary of Economics”, Eds. Steven N. Durlauf and Lawrence E. Blume, Palgrave Macmillan, 2010, The New Palgrave Dictionary of Economics Online, Palgrave Macmillan. 16 August 2014.

This entry was posted in statistics, stock market forecasting and tagged , , , . Bookmark the permalink.

Leave a comment