I’ve been thinking about methods for combining p-values from independent tests. Fisher’s method is the classic one, but in general it helps to think about the problem in geometric terms. A set of *n* p-values is a point in an n-dimensional cube. Under the null hypothesis, these points are distributed uniformly in the cube. We want a function *F* that maps points in this volume to [0,1]. Another requirement is that the volume satisfying should have volume *z*. We are mostly interested in the points near the origin, sets of small p-values. The question is how to specify which points are close to the origin.

[**updated**: a paper on this topic: Art Owen, KARL PEARSONâ€™S META-ANALYSIS REVISITED, The Annals of Statistics (2009) (arxiv.org stanford.edu)]

Fisher’s method calculates the product of p-values, then takes the logarithm and multiplies by -2. Under the null hypothesis, we have:

The Wikipedia page explains that the negative log of uniform random variables follows an exponential distribution. When scaled by 2 these follow a chi-squared distribution with 2 degrees of freedom, and a sum of chi-squared distributions is also chi-squared with the degrees of freedom summed. If the p-values are lower than expected under the null hypothesis, then the combined value from Fisher’s method will be in the upper tail. We can write this method in R:

`fishersMethod = function(x) pchisq(-2 * sum(log(x)),df=2*length(x),lower=FALSE) `

I was thinking of summing p-values as well, which would have different behavior to the log product. As described by John Cook, the sum of *n* random uniform variables follows a law which can be expressed fairly simply. The cumulative distribution functions is as follows (see John Cook’s blog for the density and reference)

And due to the Central Limit Theorem, as the number of random uniforms increases, the sum will converge in distribution to a normal distribution with mean n/2 and variance n/12. So we can also use this law to give a combined p-value to a set of p-values. If the p-values are lower than expected from the null, the sum would show up in the lower tail of the distribution. This cdf written as an R function:

psumunif = function(x,n) 1/factorial(n) * sum(sapply(0:n, function(k) (-1)^k * choose(n,k) * ifelse(x > k,x-k,0)^(n)))

and switching for large n to the normal approximation:

sumPvalsMethod = function(x) {

n = length(x)

if (n < 10) {

psumunif(sum(x),n)

} else {

pnorm(sum(x),n/2,sqrt(n/12),lower=TRUE)

}}

The difference between these two methods can be visualized with a 2 dimensional example. The axes are possible p-values from 0 to 1 for two independent tests. The bottom left corner is (0,0) and the top right corner is (1,1). The contours of the combined p-values are shown:

contourplot(outer(seq(0,1,.05),seq(0,1,.05),Vectorize(function(x,y) fishersMethod(c(x,y)))),main="Fisher's method")

contourplot(outer(seq(0,1,.05),seq(0,1,.05),Vectorize(function(x,y) sumPvalsMethod(c(x,y)))),main="Sum of p-values method")

The contours of the sum of p-values method are hyperplanes cutting off the corner of the n-dimensional cube. What might be preferable with the sum method is that the regions near the top left and bottom right corner are given higher values than with Fisher’s method. In other words, Fisher’s method will forgive a few large p-values given that the others are small. For example:

x = c(1e-3,1e-3,1e-3,1)

fishersMethod(x)

[1] 1.719731e-06

sumPvalsMethod(x)

[1] 0.04216892

A downside is that this method seems to be more sensitive to unbalanced p-values which can result from a bad construction of the null hypotheses underlying the original tests. The intuition is that the hyperplanes are reaching further into the middle of the cube than the contours of Fisher’s method. For example:

x = c(runif(1000),runif(100,.1,.2))

fishersMethod(x)

[1] 0.01510909

sumPvalsMethod(x)

[1] 0.0005628212

Finally, another method would be to count the number of p-values < .05. Given independent tests, under the null hypothesis for all tests, this count will be distributed as a binomial random variable with size n and probability .05. In this case the upper tail is where a set of too small p-values would show up:

` binomMethod = function(x) pbinom(sum(x < .05),size=length(x),prob=.05,lower=FALSE) `

Geometrically, this is represented by blocks of width .05 following the axes. Intersections of blocks result in lower p-values. One problem with this method is from discreteness: with small n, the combined p-value will not be uniformly distributed across [0,1]. In fact, the combined p-values from random uniforms do not look uniform going up to n = 1e4. In comparison, the distributions of the combined p-values from the other methods do look uniform even for small n. Another problem is that in depending on a small subset of the total p-values, the combined p-value could exhibit more variance than methods which use all p-values.

One advantage is in the case mentioned above, with unbalanced p-values. If these are outside of the .05 region, these won’t affect the binomial method very much. For example:

x = c(runif(1000),runif(100,.1,.2))

fishersMethod(x)

[1] 0.01510909

sumPvalsMethod(x)

[1] 0.0005628212

binomMethod(x)

[1] 0.4642092

Combining p-values is an interesting problem, but also misguided I think. A more profitable course would be to do some sort of meta-analysis. (combining the _effects_).

I agree and would recommend multilevel modeling when it is possible.

But hypothesis testing can be nice in certain circumstances. In my case I am trying to compare tens of thousands of observations in hundreds of groups, and I am aiming to do this in <1 second so that this can be repeated thousands of times.

While ensuring a uniform distribution of p-values is required in constructing a test, it doesn’t tell us anything about the statistical power of the test.

I used your code to compare the power of the Fisher method, the sum method, and an effects based method for a given set of simple hypotheses. Assuming equal variances, the effects method is more powerful than the sum method, but the sum method is more powerful than Fisher’s.

#R Code

N = 20000 #Number of times to repeat combining

k = 5 #Number of samples (p-values) for each trial

p_fish = numeric(N); p_sum = numeric(N); p_meta = numeric(N)

for (i in 1:N){

r = rnorm(k,-1,1) #Samples from alternate hypothesis 1 unit away (modify SD here for interesting results)

p = pnorm(r,0,1) #Get p-values for 1-sided test

p_fish[i] = fishersMethod(p)

p_sum[i] = sumPvalsMethod(p)

p_meta[i] = pnorm(mean(r),0,1/sqrt(k))

}

g <- c(rep(1, N), rep(2, N), rep(3, N))

#Requires Hmisc package

Ecdf(c(p_fish, p_sum, p_meta), group =g, col=c('blue', 'red', 'green'),

xlab="Alpha (Type 1 Error)", ylab="Statistical Power (1-Beta)")