Designing multi-lab replication projects: Number of labs matters more than number of participants

In a multi-lab replication project, multiple teams of investigators team up to all run the same study (or studies) concurrently at different research sites. The best examples of this in psychology are the various Many Labs projects. There are lots of reasons why multi-lab replication projects are great. For example, they allow us to estimate and potentially model any between-site variability in the effect size, so we learn more about the generality of the effect. Another reason is that they can have greater statistical power than single-lab studies — as long as they involve a large enough sample size of labs. The point of this blog post is to underscore this point about the number of labs needed to ensure high statistical power.

The verbal (intuitive?) explanation

We’re used to thinking about the number of participants as being, apart from the effect size, the chief factor in determining the statistical power of a study. And most of the time, in the kinds of single-lab studies we tend to run, this is basically true. Other factors matter as well — for example, the number of times we observe each participant, and the proportion of observations in each cell of the design — but as long as these other factors are within a typically sane range, they tend not to matter as much as the number of participants.

So it is perhaps natural that we apply this same heuristic to multi-lab replication projects. We reason that even if each lab can only recruit, say, 100 participants, and even if we can only recruit, say, 5 labs, this will give us a total of 500 participants, so statistical power will still be quite high! Right?

But here’s the thing. The reason the number of participants has such a big impact on power in the single-lab case is that, in those cases, the participants are the highest-level units or clusters in the design. That is to say, we can potentially have multiple observations of each participant — for example, we collect 50 reaction times from each participant — but these observations are clustered within participants, which are the high-level units. It turns out that the proper generalization of the “number of participants is important” heuristic is, in fact, “the number of highest-level units is important — the lower-level units, not as much.”

So now consider a multi-lab replication project. Here, the participants are clustered in labs. So the labs are the highest-level units. Remember the earlier example about having a study with 5 labs, each with 100 participants? In terms of its statistical power, this would be about like running a single-lab study with 5 participants, each of whom contributes 100 reaction times. In other words, it wouldn’t be great.

The quantitative view

Let’s look at some actual power results. We consider a simple multi-lab design where we have $m$ labs, each of which recruits $n$ participants that are divided into two conditions ( $n/2$ participants per condition), and we observe each participant only a single time. In other words, we have a simple two-group between-subject experiment that is being replicated at $m$ different labs, and the labs have random effects. The key quantity for determining the power of the study is $\delta$, the noncentrality parameter (for a non-central t distribution). It looks like this: $\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

where $d$ is the effect size (Cohen’s d), $E$ is the proportion of random variation due to error variance (i.e., the ratio of the Error variance over the [weighted] sum of all the variance components), and $L$ is the proportion of random variation due to Lab variance (actually, it’s the proportion of Lab-by-Condition interaction variance, but I’m calling it Lab variance for short). Statistical power is pretty much determined by the noncentrality parameter — there’s technically also some influence of the degrees of freedom, but that tends not to matter much as long as it is above, say, 20 or so. So basically, we can understand the power of this multi-lab replication design by considering this noncentrality parameter expression.

First, let’s just plug a range of plausible values into the variables comprising $\delta$ and see what values of statistical power they imply. Here’s a series of contour plots where we vary the variables within plausible ranges. Statistical power of the multi-lab replication design as a function of m, n, d, and L. The ranges of values for m and n probably don’t need any additional justification. For the range of Cohen’s d effect sizes, see this earlier blog post. The proportion of Error variance is always fixed at E = 50%, which in my informed opinion is a plausible value, but basically E doesn’t usually have much impact on power anyway, so the exact value is not too important. The range of values for L, the proportion of Lab variance, is much more interesting — as you can see, this actually has a big impact on power, so it’s important that our assumed values of L are reasonable. I have assumed that a plausible range is about from 1% to 10%, with the most plausible value around 5% or so. The justification for this is rather involved, so I wrote up a separate little document about it HERE. R code to reproduce this figure can be found HERE.

The middle panel represents what I think are the most plausible values. There are a couple of interesting things to point out about these power results. The first is that…

Increasing the number of labs usually raises power more quickly than increasing the number of participants per lab

The way to see this in the graphs is to consider the angle of the contours in each plot. More specifically, for any given point (i.e., pair of sample sizes) in any of the plots, consider the direction in which we would want to step to increase power fastest. For most parameter combinations, the path of steepest ascent up the power surface goes up along the y-axis (number of labs) more than it goes sideways along the x-axis (participants per lab). This is especially true when there is a lot of Lab variance ( $L=10\%$, in the right-hand column), but is still usually true when there is little Lab variance ( $L=1\%$, in the left-hand column).

There is another way of visualizing this that makes this point more clear. It uses the idea of indifference curves — technically speaking, curves where the rate of change in the noncentrality parameter w.r.t. the number of labs is equal to the rate of change w.r.t. participants per lab. The way these indifference curves are plotted below, studies that are below the relevant indifference curve would get a greater power benefit from increasing the number of labs, and studies that are above the relevant indifference curve would get a greater power benefit from increasing the number of participants per lab. For studies that lie exactly on the indifference curve, power would increase just as fast by increasing the number of labs as by increasing the number of participants per lab. Proportion of Error variance is fixed at E = 50%. The indifference curves do not depend on the effect size d… yay!

As you can see, most of the time there is a greater benefit (i.e., statistical power will increase faster) to increasing the number of labs. This is especially true if there is a lot of Lab variance. But it tends to be true even when there is little Lab variance. The cases where it makes more sense to increase the number of participants per lab are when you already have a large number of labs but a small number of participants per lab. And let’s face it, your multi-lab replication project is probably not in this part of the space.

Increasing the number of participants per lab — but holding constant the number of labs — will not, in general, cause statistical power to approach 100%

This one is important. We tend to assume that as long as we continue to recruit participants, eventually we will have high statistical power. So even if we didn’t recruit as many labs for our project as we had hoped, we should be able to compensate for this by just recruiting more participants per lab — right? Unfortunately it isn’t so. The truth is that if we hold constant the number of labs, statistical power does not approach 100% as the number of participants per lab approaches infinity. Instead, power approaches some maximum attainable power value that can possibly be quite small, depending on the effect size, number of labs, and the Lab variance.

This can actually be seen pretty easily by considering again the expression of the noncentrality parameter: $\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

In the limit as $n$ approaches infinity, the term in the denominator involving $E$ disappears, but the term involving $L$ does not, so the whole noncentrality parameter converges to a finite and possibly small value. Here’s what the situation looks like graphically for a handful of representative values of the variables: Effect size d = 0.4. Proportion of Error variance E = 50%.  In the left panel, L is 1%, 5%, or 10%, and n is 16, 32, or 64. In the right panel, L is 1%, 5%, or 10%, and m is 4, 8, or 16.

The curves in each panel are unlabeled because, honestly, they don’t really matter. (If you want to know the power values for specific parameter combinations, the first graph is better for that anyway.) The point is just to show that when we increase the number of labs, power does eventually approach 100%, and the values of the other variables simply affect how quickly this happens. But when we increase the number of participants per lab — but, crucially, hold constant the number of labs — power sometimes approaches a value close to 100%, but often does not. The maximum attainable power will tend to be low when (a) the effect size is small, (b) the number of labs is small, (c) the Lab variance is high.

Conclusion

The main conclusion is pretty simple. Multi-lab replication projects are great, and you should do them… but when you’re designing them, you should really try hard to recruit as many labs to collaborate in the project as you possibly can. You actually don’t need that many participants from each lab — the first figure shows that unless the Lab variance is tiny, you get quite diminished returns by recruiting more than 100 or so participants per lab — so perhaps this is a selling point that you can use to convince more labs to join your project (“we would only need you to run a few dozen participants!”).

If you want to do your own power analyses for multi-lab designs like this, or for other more complicated designs, you can use my PANGEA app.