# Designing multi-lab replication projects: Number of labs matters more than number of participants

In a multi-lab replication project, multiple teams of investigators team up to all run the same study (or studies) concurrently at different research sites. The best examples of this in psychology are the various Many Labs projects. There are lots of reasons why multi-lab replication projects are great. For example, they allow us to estimate and potentially model any between-site variability in the effect size, so we learn more about the generality of the effect. Another reason is that they can have greater statistical power than single-lab studies — as long as they involve a large enough sample size of labs. The point of this blog post is to underscore this point about the number of labs needed to ensure high statistical power.

## The verbal (intuitive?) explanation

We’re used to thinking about the number of participants as being, apart from the effect size, the chief factor in determining the statistical power of a study. And most of the time, in the kinds of single-lab studies we tend to run, this is basically true. Other factors matter as well — for example, the number of times we observe each participant, and the proportion of observations in each cell of the design — but as long as these other factors are within a typically sane range, they tend not to matter as much as the number of participants.

So it is perhaps natural that we apply this same heuristic to multi-lab replication projects. We reason that even if each lab can only recruit, say, 100 participants, and even if we can only recruit, say, 5 labs, this will give us a total of 500 participants, so statistical power will still be quite high! Right?

But here’s the thing. The reason the number of participants has such a big impact on power in the single-lab case is that, in those cases, the participants are the highest-level units or clusters in the design. That is to say, we can potentially have multiple observations of each participant — for example, we collect 50 reaction times from each participant — but these observations are clustered within participants, which are the high-level units. It turns out that the proper generalization of the “number of participants is important” heuristic is, in fact, “the number of highest-level units is important — the lower-level units, not as much.”

So now consider a multi-lab replication project. Here, the participants are clustered in labs. So the labs are the highest-level units. Remember the earlier example about having a study with 5 labs, each with 100 participants? In terms of its statistical power, this would be about like running a single-lab study with 5 participants, each of whom contributes 100 reaction times. In other words, it wouldn’t be great.

## The quantitative view

Let’s look at some actual power results. We consider a simple multi-lab design where we have $m$ labs, each of which recruits $n$ participants that are divided into two conditions ($n/2$ participants per condition), and we observe each participant only a single time. In other words, we have a simple two-group between-subject experiment that is being replicated at $m$ different labs, and the labs have random effects. The key quantity for determining the power of the study is $\delta$, the noncentrality parameter (for a non-central t distribution). It looks like this:

$\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

where $d$ is the effect size (Cohen’s d), $E$ is the proportion of random variation due to error variance (i.e., the ratio of the Error variance over the [weighted] sum of all the variance components), and $L$ is the proportion of random variation due to Lab variance (actually, it’s the proportion of Lab-by-Condition interaction variance, but I’m calling it Lab variance for short). Statistical power is pretty much determined by the noncentrality parameter — there’s technically also some influence of the degrees of freedom, but that tends not to matter much as long as it is above, say, 20 or so. So basically, we can understand the power of this multi-lab replication design by considering this noncentrality parameter expression.

First, let’s just plug a range of plausible values into the variables comprising $\delta$ and see what values of statistical power they imply. Here’s a series of contour plots where we vary the variables within plausible ranges.

The middle panel represents what I think are the most plausible values. There are a couple of interesting things to point out about these power results. The first is that…

### Increasing the number of labs usually raises power more quickly than increasing the number of participants per lab

The way to see this in the graphs is to consider the angle of the contours in each plot. More specifically, for any given point (i.e., pair of sample sizes) in any of the plots, consider the direction in which we would want to step to increase power fastest. For most parameter combinations, the path of steepest ascent up the power surface goes up along the y-axis (number of labs) more than it goes sideways along the x-axis (participants per lab). This is especially true when there is a lot of Lab variance ($L=10\%$, in the right-hand column), but is still usually true when there is little Lab variance ($L=1\%$, in the left-hand column).

There is another way of visualizing this that makes this point more clear. It uses the idea of indifference curves — technically speaking, curves where the rate of change in the noncentrality parameter w.r.t. the number of labs is equal to the rate of change w.r.t. participants per lab. The way these indifference curves are plotted below, studies that are below the relevant indifference curve would get a greater power benefit from increasing the number of labs, and studies that are above the relevant indifference curve would get a greater power benefit from increasing the number of participants per lab. For studies that lie exactly on the indifference curve, power would increase just as fast by increasing the number of labs as by increasing the number of participants per lab.

As you can see, most of the time there is a greater benefit (i.e., statistical power will increase faster) to increasing the number of labs. This is especially true if there is a lot of Lab variance. But it tends to be true even when there is little Lab variance. The cases where it makes more sense to increase the number of participants per lab are when you already have a large number of labs but a small number of participants per lab. And let’s face it, your multi-lab replication project is probably not in this part of the space.

### Increasing the number of participants per lab — but holding constant the number of labs — will not, in general, cause statistical power to approach 100%

This one is important. We tend to assume that as long as we continue to recruit participants, eventually we will have high statistical power. So even if we didn’t recruit as many labs for our project as we had hoped, we should be able to compensate for this by just recruiting more participants per lab — right? Unfortunately it isn’t so. The truth is that if we hold constant the number of labs, statistical power does not approach 100% as the number of participants per lab approaches infinity. Instead, power approaches some maximum attainable power value that can possibly be quite small, depending on the effect size, number of labs, and the Lab variance.

This can actually be seen pretty easily by considering again the expression of the noncentrality parameter:

$\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

In the limit as $n$ approaches infinity, the term in the denominator involving $E$ disappears, but the term involving $L$ does not, so the whole noncentrality parameter converges to a finite and possibly small value. Here’s what the situation looks like graphically for a handful of representative values of the variables:

The curves in each panel are unlabeled because, honestly, they don’t really matter. (If you want to know the power values for specific parameter combinations, the first graph is better for that anyway.) The point is just to show that when we increase the number of labs, power does eventually approach 100%, and the values of the other variables simply affect how quickly this happens. But when we increase the number of participants per lab — but, crucially, hold constant the number of labs — power sometimes approaches a value close to 100%, but often does not. The maximum attainable power will tend to be low when (a) the effect size is small, (b) the number of labs is small, (c) the Lab variance is high.

## Conclusion

The main conclusion is pretty simple. Multi-lab replication projects are great, and you should do them… but when you’re designing them, you should really try hard to recruit as many labs to collaborate in the project as you possibly can. You actually don’t need that many participants from each lab — the first figure shows that unless the Lab variance is tiny, you get quite diminished returns by recruiting more than 100 or so participants per lab — so perhaps this is a selling point that you can use to convince more labs to join your project (“we would only need you to run a few dozen participants!”).

If you want to do your own power analyses for multi-lab designs like this, or for other more complicated designs, you can use my PANGEA app.

# Don’t fight the power (analysis)

Researchers often feel uneasy about using power analysis to design their actual experiments because of uncertainty about the effect size in the study to be run. A common sentiment that one hears goes something like:

“I can’t do a power analysis because I have no idea what the effect size is. If I knew the effect size, I wouldn't have to run the study in the first place!”

The implication of this view is that, unless one has actually done experiments in the past that are pretty similar to the one being considered, there is otherwise no justifiable basis for making any particular assumptions about the effect size in the present study. In order to have a good idea about the effect size, the argument goes, we have to actually run the study, at which point obviously the power analysis is no longer needed. Convinced by this reasoning, many researchers throw up their hands, decide that power analysis will not be useful here or perhaps ever, and just plan instead on collecting some loosely conventional sample size that depends on their research area, but is usually something like 20-30 observations per cell of the design. In other words, they fight the power.

I’m here to convince you that fighting the power is a self-defeating research habit.

#### You know more than you think before the study

The first premise of the argument against power analysis is that we know little or nothing about the effect size before the study has been run. On the contrary. In the year 2015 we can benefit from decades of meta-analyses that have summarized the typical effect sizes found in almost any imaginable corner of the research literature. We even have meta-meta-analyses of those meta-analyses. The effect size in your future study is likely to resemble the effect sizes of the past, and luckily for us, the meta-analytic data on typical effect sizes are vast.

I want to illustrate just how good our situation really is by considering what is probably our worst case scenario in terms of study design: The case where we know absolutely nothing whatsoever about the study to be run except that the topic matter could broadly be classified as “social psychology” or some related field. In that case, we can use the data from Richard, Bond, and Stokes-Zoota (2003), who conducted a meta-analysis of meta-analyses in the field of social psychology to determine the range of typical effect sizes across the entire field, involving some 25,000 individual studies published over 100 years in diverse research areas. While the focus of this meta-meta-analysis was the field of social psychology, I believe there is little reason to expect the distribution of typical effect sizes to be appreciably different in other areas of psychology, such as cognitive psychology (if you are aware of large-scale meta-analytic data to the contrary, please let me know). Anyway, the figure below summarizes the distribution of effect sizes that they found.

Their meta-analysis actually examined the effects on the Pearson’s r (correlation) scale, and the bumpy density curve in the left panel shows their aggregated data (copied/pasted from their Figure 1). The smooth curve overlaying that data is the best-fitting beta distribution1, on which the percentiles and other statistics are based, and the curve in the right panel is based on applying a standard conversion formula to the smooth curve in the left panel2.

What this shows is that, in the absence of any other information about the study you are about to run, a pretty reasonable assumption about the effect size is that it is equal to the historical average: = 0.21 or d = 0.45. Or you could use the median, or be conservative and go with the 30th percentile, or whatever you want. The point is, we have enough information to make a pretty well-informed decision even if we have no specific information at all about the actual study.

Of course, in most cases in the real world, you probably do know something about the study you are about to run. In almost all cases, that knowledge will allow you to make an even more refined estimate of the effect size, either by finding a meta-analysis that looks more specifically at effects that are conceptually similar to yours (you could even start with Richard et al., who helpfully break down the average effect size in social psychology by broad research area), or just by starting with the aggregate historical estimate and adjusting from there based on how you think your study differs from the average study of the past.

#### You know less than you think after the study

The argument that opened this post pointed out that we don’t know the effect size before the study has been run. That’s true, but of course, we don’t know the effect size after the study has been run either. Instead what we have is some data from which we can construct an estimate of the effect size. Realizing this allows us to ask the quantitative question: Just how good of an effect size estimate do we have at the end of a typically-sized experiment? If our estimate of the effect size after an initial study is not much better than what we could already surmise based on the historical, meta-analytic data, then it doesn’t make a lot of sense to trust the former a lot more than the latter.

Consider a typical study in which we compare two independent groups with n=30 participants per group. Below I’ve simulated some data in which the standardized mean difference between the two groups is exactly equal to the historical average of d = 0.45. The figure below shows a bootstrap sampling distribution of the effect size in this situation3. If we ignore all prior information that we have about typical values of the effect size, as many researchers routinely do, then this sampling distribution summarizes everything we know about the effect size after running a pretty typical study.

Compare this distribution to the right panel of the first Figure from above, which showed our prior knowledge about the likely values of d. In terms of how much information they carry about d, the two distributions are really not that different. The sampling distribution is slightly less variable—it has a standard deviation of 0.27 rather than 0.37—but this difference in variability is quite hard to see from visual inspection.

#### Living with uncertainty

Whether we use historical data or data from previous experiments we have run, there will always be some uncertainty about the effect size. So there are a range of plausible assumptions we could make about the effect size when doing a power analysis, and these different assumptions imply different sample sizes to collect in the study. In many cases, the uncertainty will be pretty high, so that the range of recommended sample sizes will be quite wide, a fact which many researchers find disconcerting.

Uncertainty is a fact of scientific life and should be no cause for dismay. We have all (hopefully) learned to be comfortable with uncertainty in other aspects of the research process. Unfortunately, many researchers seem oddly unwilling to accept even modest uncertainty in the planning phase of the research. In responding to such a view, it’s hard to put it better than @gung did in this answer on Cross Validated:

“Regarding the broader claim that power analyses (a-priori or otherwise) rely on assumptions, it is not clear what to make of that argument. Of course they do. So does everything else. Not running a power analysis, but just gathering an amount of data based on a number you picked out of a hat, and then analyzing your data, will not improve the situation.”

Uncertainty is there whether we like it or not. We should try to make the best design decisions possible in light of that uncertainty. Power analysis is our best tool for doing so. Before I close the post, let me clarify: In my opinion, there is nothing wrong with planning experiments based on rules of thumb. I acknowledge that much of the time it won’t make sense to do a formal power analysis for each and every experiment, because often we won’t have a lot of specific information about the particular study we’re about to run beyond the kind of general information we have about the typical experiments we tend to run. My point is that we should apply statistically well-informed rules of thumb that are based on historical, meta-analytic data, and are calibrated to work pretty well in a range of realistic research situations—not dubious heuristics like an n=30 rule. One of the most important functions of power analysis is to help us construct such good rules of thumb.

1 For those interested, the parameters of this beta distribution are about $\alpha=1.34, \beta=5.03$.

2 The correct conversion from Pearson’s r to Cohen’s d depends on the assumed proportion of participants in the two groups. The statistics that I present in the figure are based on the standard formula that assumes the group sizes are equal. I experimented with various ways of relaxing that assumption in a realistic manner, but ultimately found that the difference was negligible unless one assumes the group sizes tend to be markedly and unrealistically unequal.

3 The mean shown in the figure is the mean of the bootstrap distribution. This mean is slightly higher than the assumed value of 0.45 because the sampling distribution of d is slightly positively skewed, reflecting the fact that sample d is a slightly positively biased estimate of population d.

4 Thanks to Katie Wolsiefer for this figure caption, which is way better than my original.

# Follow-up: What about Uri’s 2n rule?

This post is a quick follow-up to yesterday’s post on sample size rules. Basically I thought it was a little too long to go in the comments section so here it is.

Some people on twitter (and in my blog comments) remarked that my conclusion appears to fly in the face of some things Uri Simonsohn wrote on a similar topic not too long ago. Briefly, Uri writes of a Study 1 in which there are two independent groups (A1 and B1) and some non-zero effect size, and a Study 2 in which we move to a 2×2 design in which the difference between conditions A1 and B1 is the same size as before, but the difference between the other two conditions A2 and B2 is exactly 0, and we are now testing the interaction effect. Uri concludes: “To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.” Let’s call this Uri’s 2n rule*.

I thought Uri’s post was cool and certainly don’t think my point contradicts the point he made. The important thing to note here is that the effect size that Uri assumes to be constant in both studies is just the A1-B1 difference. But that’s not the effect we’re actually testing in Study 2: we’re testing the interaction effect, i.e., the A1 – B1 – A2 + B2 difference. And there is no assumption that the effect size for the interaction in Study 2 is equal to the effect size for the A1-B1 difference in Study 1. In fact, the 2n rule depends on the relevant effect size in Study 2 being half the relevant effect size in Study 1. That’s why you must increase the sample size when moving to Study 2; for the situation Uri’s talking about, the relevant effect size gets cut in half in Study 2. In my post I’m talking about cases where the relevant effect size is held constant. If the relevant effect size is held constant, then adding cells to the design has a negligible impact on power.

Numerical example

Consider the following table of cell means and standard deviations (the latter in parentheses).

Let’s say there are 20 subjects in each cell. Now if Study 1 involves only groups A1 and B1 (so that there is N=40 in total) then the power of the study is 34%. And if Study 2 involves all four groups (so that there is N=80 in total), then the power to detect the interaction effect is only 20%. But if we double the sample size (so that there is n=40 in each cell and N=160 in total), then the power to detect the interaction effect is 34%, just as it was in Study 1. This is the 2n rule that Uri wrote about, and I don’t dispute it.

But now let’s look at the standardized effect sizes for the two studies. We use Cohen’s d, defined as $d=(\mu_1-\mu_2)/\sigma$, where $\mu_1$ and $\mu_2$ are the two means being compared and $\sigma$ is the pooled standard deviation, or equivalently, the root mean squared error (RMSE). Computing this in Study 1 is straightforward since there are only two groups; we have $d=\frac{0.5-0}{1} = 0.5$.

Computing d in Study 2 is a little less clear since in that case we have four groups and not two. We saw above that the relevant mean difference is the A1 – B1 – A2 + B2 difference. The key here is to realize that the interaction effect essentially still comes down to a comparison of two groups: We are comparing the A1 and B2 groups (which have coefficients of +1 in this difference score) against the A2 and B1 groups (which have coefficients of -1). So the two relevant means to use in computing d are the mean of the A1 and B2 means, and the mean of the A2 and B1 means. This gives us $d=(\frac{0.5+0}{2}-\frac{0+0}{2})/1=0.25$ for the interaction effect. In other words, the relevant effect size in Study 2 is half of what the relevant effect size in Study 1 was.

It’s easy to show this symbolically. Let $d_1$ be the effect size in Study 1 and $d_2$ be the effect size in Study 2. Then, starting with the classical definition of d,

$d_2=\frac{\mu_1-\mu_2}{\sigma}=\frac{\frac{\mu_{A1}+\mu_{B2}}{2}-\frac{\mu_{A2}+\mu_{B1}}{2}}{\sigma}=\frac{(\mu_{A1}-\mu_{B1})-(\mu_{A2}-\mu_{B2})}{2\sigma}$.

In this example we’ve assumed the $\mu_{A2}-\mu_{B2}$ difference is 0, so that

$d_2=\frac{\mu_{A1}-\mu_{B1}}{2\sigma}=\frac{d_1}{2}$.

* Robert Abelson wrote about a similar phenomenon which he called the 42% rule: If the A1-B1 difference just reaches significance at p=.05 with size d, then the A2-B2 difference has to be at least 42% as large as d in the opposite direction for the interaction in Study 2 to reach p=.05 with the same sample size.

# Think about total N, not n per cell

The bottom line of this post is simple. There are lots of rules of thumb out there for minimum sample sizes to use in between-subjects factorial experiments. But they are virtually always formulated in terms of the sample size per cell, denoted as small n. For example, one ubiquitous informal rule suggests using about n=30 or so. The problem is that cursory power analysis shows that rules based on small n don’t really make sense in general, because what is more directly relevant for power is the total sample size, denoted as big N. So if you must rely on rules of thumb—which I actually don’t have any big problems with—try to use sample size rules based on big N, not small n.

The example

The idea of writing about this came from a recent interaction with a colleague, which interaction I describe here in slightly simplified form. My colleague was reviewing a paper and asked my advice about something the authors wrote concerning the power of the study. The study involved a 2×2×2 between-subjects design with a total sample size of N=128, and the authors had remarked in their manuscript that such a study should have statistical power of about 80% to detect a canonical “medium” effect size of Cohen’s d = 0.5.

This did not seem right to my colleague: “This seems impossible. I recently read a paper that clearly said that, in a simple two-independent-groups design, one needs about n=64 in each group to have 80% power to detect a medium effect size. But in this study the sample size works out to only n=16 in each cell! Intuitively it seems like the required sample size for a given power level should increase as the design becomes more complicated, or at least stay the same, but definitely not decrease. So their power calculation must be flawed, right?”

The intuition seems reasonable enough when stated this way, but in fact it isn’t true. The problem here is the assumption that the relevant sample size for power purposes is the sample size per cell, small n.

Big N vs. small n

Mathematically, it’s a little tricky to say that power is a function of N rather than n. After all, one can write the relevant equations in terms of N or in terms of n, so in that sense power is a function of whichever one you prefer. The argument here is that power is a much more natural and simple function of N than it is a function of n, so that rules of thumb based on N are far more useful than rules of thumb based on n.

One could justify this argument rather formally by looking at the symbolic expression of the noncentrality parameter written in different ways. But really the most straightforward and probably most compelling way to see that it’s true is just to specify two sample size rules, one based on N and one based on n, and to compare the statistical power resulting from each rule for a few different designs and effect sizes, which is what I’ve done in the table below.

Here I’ve chosen an N=128 rule just to be consistent with the example from before, but the general conclusion is clear. Using a rule based on N, power is a pretty simple and well-behaved function of the effect size alone, regardless of the particular between-subjects factorial design being considered. Using a rule based on n, power remains a more complicated, joint function of the effect size and the factorial structure.

Final caveat / technical footnote

Here, for the sake of simplicity, I’ve restricted myself to examining designs where all the factors have 2 levels, sometimes called 2k factorials. In between-subjects factorials where some of the factors have >2 levels, the appropriate sample size rule is slightly more complicated in that it depends on the particular contrast being tested. In these cases, a rule based on the total number of observations that are actually involved in the contrast—which we might call N’—works pretty well as a simple approximation in most cases. The more technically correct (but more complicated) procedure depends on the product of N and the variance of the contrast; see this working paper of mine for more details.