Designing multi-lab replication projects: Number of labs matters more than number of participants

In a multi-lab replication project, multiple teams of investigators team up to all run the same study (or studies) concurrently at different research sites. The best examples of this in psychology are the various Many Labs projects. There are lots of reasons why multi-lab replication projects are great. For example, they allow us to estimate and potentially model any between-site variability in the effect size, so we learn more about the generality of the effect. Another reason is that they can have greater statistical power than single-lab studies — as long as they involve a large enough sample size of labs. The point of this blog post is to underscore this point about the number of labs needed to ensure high statistical power.

The verbal (intuitive?) explanation

We’re used to thinking about the number of participants as being, apart from the effect size, the chief factor in determining the statistical power of a study. And most of the time, in the kinds of single-lab studies we tend to run, this is basically true. Other factors matter as well — for example, the number of times we observe each participant, and the proportion of observations in each cell of the design — but as long as these other factors are within a typically sane range, they tend not to matter as much as the number of participants.

So it is perhaps natural that we apply this same heuristic to multi-lab replication projects. We reason that even if each lab can only recruit, say, 100 participants, and even if we can only recruit, say, 5 labs, this will give us a total of 500 participants, so statistical power will still be quite high! Right?

But here’s the thing. The reason the number of participants has such a big impact on power in the single-lab case is that, in those cases, the participants are the highest-level units or clusters in the design. That is to say, we can potentially have multiple observations of each participant — for example, we collect 50 reaction times from each participant — but these observations are clustered within participants, which are the high-level units. It turns out that the proper generalization of the “number of participants is important” heuristic is, in fact, “the number of highest-level units is important — the lower-level units, not as much.”

So now consider a multi-lab replication project. Here, the participants are clustered in labs. So the labs are the highest-level units. Remember the earlier example about having a study with 5 labs, each with 100 participants? In terms of its statistical power, this would be about like running a single-lab study with 5 participants, each of whom contributes 100 reaction times. In other words, it wouldn’t be great.

The quantitative view

Let’s look at some actual power results. We consider a simple multi-lab design where we have $m$ labs, each of which recruits $n$ participants that are divided into two conditions ($n/2$ participants per condition), and we observe each participant only a single time. In other words, we have a simple two-group between-subject experiment that is being replicated at $m$ different labs, and the labs have random effects. The key quantity for determining the power of the study is $\delta$, the noncentrality parameter (for a non-central t distribution). It looks like this:

$\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

where $d$ is the effect size (Cohen’s d), $E$ is the proportion of random variation due to error variance (i.e., the ratio of the Error variance over the [weighted] sum of all the variance components), and $L$ is the proportion of random variation due to Lab variance (actually, it’s the proportion of Lab-by-Condition interaction variance, but I’m calling it Lab variance for short). Statistical power is pretty much determined by the noncentrality parameter — there’s technically also some influence of the degrees of freedom, but that tends not to matter much as long as it is above, say, 20 or so. So basically, we can understand the power of this multi-lab replication design by considering this noncentrality parameter expression.

First, let’s just plug a range of plausible values into the variables comprising $\delta$ and see what values of statistical power they imply. Here’s a series of contour plots where we vary the variables within plausible ranges.

The middle panel represents what I think are the most plausible values. There are a couple of interesting things to point out about these power results. The first is that…

Increasing the number of labs usually raises power more quickly than increasing the number of participants per lab

The way to see this in the graphs is to consider the angle of the contours in each plot. More specifically, for any given point (i.e., pair of sample sizes) in any of the plots, consider the direction in which we would want to step to increase power fastest. For most parameter combinations, the path of steepest ascent up the power surface goes up along the y-axis (number of labs) more than it goes sideways along the x-axis (participants per lab). This is especially true when there is a lot of Lab variance ($L=10\%$, in the right-hand column), but is still usually true when there is little Lab variance ($L=1\%$, in the left-hand column).

There is another way of visualizing this that makes this point more clear. It uses the idea of indifference curves — technically speaking, curves where the rate of change in the noncentrality parameter w.r.t. the number of labs is equal to the rate of change w.r.t. participants per lab. The way these indifference curves are plotted below, studies that are below the relevant indifference curve would get a greater power benefit from increasing the number of labs, and studies that are above the relevant indifference curve would get a greater power benefit from increasing the number of participants per lab. For studies that lie exactly on the indifference curve, power would increase just as fast by increasing the number of labs as by increasing the number of participants per lab.

As you can see, most of the time there is a greater benefit (i.e., statistical power will increase faster) to increasing the number of labs. This is especially true if there is a lot of Lab variance. But it tends to be true even when there is little Lab variance. The cases where it makes more sense to increase the number of participants per lab are when you already have a large number of labs but a small number of participants per lab. And let’s face it, your multi-lab replication project is probably not in this part of the space.

Increasing the number of participants per lab — but holding constant the number of labs — will not, in general, cause statistical power to approach 100%

This one is important. We tend to assume that as long as we continue to recruit participants, eventually we will have high statistical power. So even if we didn’t recruit as many labs for our project as we had hoped, we should be able to compensate for this by just recruiting more participants per lab — right? Unfortunately it isn’t so. The truth is that if we hold constant the number of labs, statistical power does not approach 100% as the number of participants per lab approaches infinity. Instead, power approaches some maximum attainable power value that can possibly be quite small, depending on the effect size, number of labs, and the Lab variance.

This can actually be seen pretty easily by considering again the expression of the noncentrality parameter:

$\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

In the limit as $n$ approaches infinity, the term in the denominator involving $E$ disappears, but the term involving $L$ does not, so the whole noncentrality parameter converges to a finite and possibly small value. Here’s what the situation looks like graphically for a handful of representative values of the variables:

The curves in each panel are unlabeled because, honestly, they don’t really matter. (If you want to know the power values for specific parameter combinations, the first graph is better for that anyway.) The point is just to show that when we increase the number of labs, power does eventually approach 100%, and the values of the other variables simply affect how quickly this happens. But when we increase the number of participants per lab — but, crucially, hold constant the number of labs — power sometimes approaches a value close to 100%, but often does not. The maximum attainable power will tend to be low when (a) the effect size is small, (b) the number of labs is small, (c) the Lab variance is high.

Conclusion

The main conclusion is pretty simple. Multi-lab replication projects are great, and you should do them… but when you’re designing them, you should really try hard to recruit as many labs to collaborate in the project as you possibly can. You actually don’t need that many participants from each lab — the first figure shows that unless the Lab variance is tiny, you get quite diminished returns by recruiting more than 100 or so participants per lab — so perhaps this is a selling point that you can use to convince more labs to join your project (“we would only need you to run a few dozen participants!”).

If you want to do your own power analyses for multi-lab designs like this, or for other more complicated designs, you can use my PANGEA app.

Five different “Cohen’s d” statistics for within-subject designs

Jeff Rouder poses an “effect size puzzler” where the puzzle is simply to compute a standardized effect size for a simulated dataset where subjects make 50 responses in each of 2 conditions. He offers a sweatshirt prize (???) to anyone who can do so “correctly.”

(Update: Jeff posted his answer to the effect size puzzler here.)

As it happens, I’ve been meaning for a while to write a blog post on issues with computing d-like effect sizes (or other standardized effect sizes) for within-subject designs, so now seems like a good time to finally hammer out the post.

Jeff didn’t actually say anything to restrict us to standardized mean difference type measures (as opposed to, say, variance explained type measures), and we can only guess whether the “correct” effect size he has in mind is a d-like measure or an $R^2$-like measure or what. But here I’ll focus on d-like measures, which are perhaps the most popular for studies with categorical predictors, and offer plenty of complications that I think are often under-appreciated (and it seems Jeff would agree).

What I’ll show here is that there are at least 5 different and non-equivalent ways that people might compute a d-like effect size (which they would invariably simply call “Cohen’s d”) for Jeff’s dataset, and the resulting effect sizes range from about 0.25 to 1.91. I’ll compare and contrast these procedures and ultimately choose one that I think is the least crazy, if you must compute a standardized effect size (more on that later). When I read a paper that reports “Cohen’s d” for a within-subject design, I usually have no idea which of these 5 procedures the authors actually applied unless I try to reproduce the effect size myself from some of the descriptives in the paper, which is often not possible.

Jeff’s dataset can be downloaded here.  Here’s some code to load it into R, examine it, print the two conditions means, and save the mean difference as
md:

d: Classical Cohen’s d

This is essentially a straightforward application of the formula provided by Cohen himself:

$d=\frac{M_1-M_2}{\sigma}$

The numerator is the mean difference. Cohen never actually gives a clear, unambiguous definition of $\sigma$ in the denominator, but it is usually taken to be the pooled standard deviation, that is, the square root of the average variance in each condition (assuming equal sample sizes in both conditions). In R code:

(d <- md / with(dat, sqrt(mean(tapply(rt, cond, var)))))
# 0.2497971


The crucial thing to recognize about applying the classical Cohen’s d is that it deliberately ignores information about the design of the study. That is, you compute d the same way whether you are dealing with a between-subjects, within-subjects, or mixed design. Basically, in computing d, you always treat the data as if it came from a simple two-independent-groups design. I don’t want to get bogged down by a discussion of why that is a good thing at this point in the post—I’ll get into that later on. For now I just note that, with this effect size, within-subject designs tend to be powerful not because they lead to larger effect sizes—if anything, the reverse is probably true, in that people elect to use within-subject designs when Cohen’s d is particularly small, for example in many reaction time studies—but rather because they allow us to efficiently detect smaller effect sizes due to removing irrelevant sources of variation from the denominator of the test statistic.

da: Cohen’s d after averaging over replicates

For this next way of computing a d-like effect size, the section heading pretty much says it all. In Jeff’s dataset, each participant makes a total of 100 responses, 50 in each condition. The computation of $d_a$ proceeds by first averaging over the 50 responses in each subject-by-condition cell of the experiment, so that each subject’s data is reduced to 2 means, and then applying the classical d formula to this aggregated data. In R code:

sub_means <- with(dat, tapply(rt, list(id, cond), mean))
(d_a <- md / sqrt(mean(diag(var(sub_means)))))
# 0.8357347


Richard Morey blogged about some weird things that can happen when you compute a standardized effect size this way. The basic issue is that, unlike classical Cohen’s d, $d_a$ does not ignore all the design information: $d_a$ will tend to be larger when there are more replicates, that is, when each subject responds more frequently in each condition.

dz: Standardized difference scores

A third way to compute a d-like effect size is to reduce each subject’s data to a single difference score—the mean difference between their responses in each condition—and then use the standard deviation of these difference scores as the denominator of d. Cohen actually discusses this statistic in his power analysis textbook (Cohen, 1988, p. 48), where he carefully distinguishes it from the classical Cohen’s d by calling it $d_z$. In R, we can compute this as:

(d_z <- md / sd(sub_means[,2] - sub_means[,1]))
# 1.353713


There is a straightforward relationship between $d_z$ and the test statistic: $t_w = d_z\sqrt{n}$, where $t_w$ is the paired-samples t-statistic from a within-subjects design and $n$ is the number of subjects. One might regard this as a virtue of $d_z$. I will argue below that I don’t think it’s a good idea to use $d_z$.

dt: Naive conversion from t-statistic

In a simple two-independent groups design, one can compute the classical Cohen’s d from the t-statistic using

$d=t_b\sqrt{\frac{2}{n}}$,

where $t_b$ is the independent-samples t-statistic for a between-subjects design and $n$ is the number of subjects per group. Many, many authors over the years have incorrectly assumed that this same conversion formula will yield sensible results for other designs as well, such as in the present within-subjects case. Dunlap et al. (1996) wrote a whole paper about this issue. One might suppose that applying this conversion formula to $t_w$ will yield $d_z$, but we can see that this is not the case by solving the equation given in the previous section for $d_z$, which yields $d_z=\frac{t_w}{\sqrt{n}}$. In other words, naive application of the between-subjects conversion formula yields an effect size that is off by a factor of $\sqrt{2}$.

To compute $d_t$ in R:

t_stat <- t.test(sub_means[,2] - sub_means[,1])\$statistic
(d_t <- t_stat*sqrt(2/nrow(sub_means)))
# 1.914439


dr: Residual standard deviation in the denominator

Finally, one can compute a d-like effect size for this within-subject design by assuming that the $\sigma$ in the classical Cohen’s d formula refers to the standard deviation of the residuals. This is the approach taken in Rouder et al. (2012) on Bayes factors for ANOVA designs. In between-subjects designs where each subject contributes a single response, this is equivalent to classical Cohen’s d. But it differs from classical Cohen’s d in designs where subjects contribute multiple responses.

To compute $d_r$ in R, we need an estimate of the residual standard deviation. Two ways to obtain this are from a full ANOVA decomposition of the data or by fitting a linear mixed model to the data. Here I do it the mixed model way:

options(contrasts=c("contr.helmert","contr.poly"))
library("lme4")
mod <- lmer(rt ~ cond + (cond||id), data=dat)
summary(mod)
# Random effects:
#  Groups   Name        Variance  Std.Dev.
#  id       (Intercept) 0.0026469 0.05145
#  id.1     cond        0.0001887 0.01374
#  Residual             0.0407839 0.20195
# Number of obs: 2500, groups:  id, 25
#
# Fixed effects:
#             Estimate Std. Error t value
# (Intercept) 0.779958   0.016402   47.55
# cond        0.052298   0.008532    6.13


The residual standard deviation is estimated as 0.20195, which gives us $d_r$0.259. It turns out that, for this dataset, this is quite close to the classical Cohen’s d, which was 0.25. Basically, classical Cohen’s d is equivalent to using the square root of the sum of all the variance components in the denominator1,2, rather than just the square root of the residual variance as $d_r$ uses. For this simulated dataset, the two additional variance components (intercepts and slopes varying randomly across subjects) are quite small compared to the residual variance, so adding them to the denominator of the effect size does not change it much. But the important thing to note is that for other datasets, it is possible that $d$ and $d_r$ could differ dramatically.

So which one should I compute?

Contrary to Jeff, I don’t really think there’s a “correct” answer here. (Well, maybe we can say that it’s hard to see any justification for computing $d_t$.) As I put it in the comments to an earlier blog post:

Basically all standardized effect sizes are just made-up quantities that we use because we think they have more sensible and desirable properties for certain purposes than the unstandardized effects. For a given unstandardized effect, there are any number of ways we could “standardize” that effect, and the only real basis we have for choosing among these different effect size definitions is in choosing the one that has the most sensible derivation and the most desirable properties relative to other candidates.

I believe that classical Cohen’s d is the option that makes the most sense among these candidates. Indeed, in my dissertation I proposed a general definition of d that I claim is the most natural generalization of Cohen’s d to the class of general ANOVA designs, and I considered it very important that it reduce to the classical Cohen’s d for datasets like Jeff’s. My reasoning is this. One of the primary motivations for using standardized effect sizes at all is so that we can try to meaningfully compare effects from different studies, including studies that might use different designs. But all of the effect size candidates other than classical Cohen’s d are affected by the experimental design; that is, the “same” effect will have a larger or smaller effect size based on whether we used a between- or within-subjects design, how many responses we required each subject to make, and so on. Precisely because of this, we cannot meaningfully compare these effect sizes across different experimental designs. Because classical Cohen’s d deliberately ignores design information, it is at least in-principle possible to compare effect sizes across different designs. Morris and DeShon (2002) is a nice paper that talks about these issues. Bakeman (2005) also has some great discussion of essentially this same issue, focused instead on “variance explained”-type effect sizes.

Although I don’t really want to say that there’s a “correct” answer about which effect size to use, I will say that if you choose to compute $d_z$$d_r$, or anything other than classical Cohen’s d, just please do not call it Cohen’s d. If you think these other effect sizes are useful, fine, but they are not the d statistic defined by Cohen! This kind of mislabeling is how we’ve ended up with 5 different ways of computing “Cohen’s d” for within-subjects designs.

Finally, there is a serious discussion to be had about whether it is a good idea to routinely summarize results in terms of Cohen’s d or other standardized effect sizes at all, even in less complicated cases such as simple between-subjects designs. Thom Baguley has a nice paper with some thoughtful criticism of standardized effect sizes, and Jan Vanhove has written a couple of nice blog posts about it. Even Tukey seemed dissatisfied with the enterprise. In my opinion, standardized effect sizes are generally a bad idea for data summary and meta-analytic purposes. It’s hard to imagine a cumulative science built on standardized effect sizes, rather than on effects expressed in terms of psychologically meaningful units. With that said, I do think standardized effect sizes can be useful for doing power analysis or for defining reasonably informative priors when you don’t have previous experimental data.

Footnotes

1 In general, this is really a weighted sum where the variance components due to random slopes must be multiplied by a term that depends on the contrast codes that were used. Because I used contrast codes of -1 and +1, it works out to simply be the sum of the variance components here, which is precisely why I changed the default contrasts before fitting the mixed model. But be aware that for other contrast codes, it won’t simply be the sum of the variance components. For more info, see pp. 20-21 of my dissertation.

2 If you actually compute classical Cohen’s d using the square root of the sum of the estimated variance components, you will find that there is a very slight numerical difference between this and the way we computed d in the earlier section (0.2498 vs. 0.2504). These two computations are technically slightly different, although they estimate the same quantity and should be asymptotically equivalent. In practice the numerical differences are negligible, and it is usually easier to compute d the previous way, that is, without having to fit a mixed model.

Don’t fight the power (analysis)

Researchers often feel uneasy about using power analysis to design their actual experiments because of uncertainty about the effect size in the study to be run. A common sentiment that one hears goes something like:

“I can’t do a power analysis because I have no idea what the effect size is. If I knew the effect size, I wouldn't have to run the study in the first place!”

The implication of this view is that, unless one has actually done experiments in the past that are pretty similar to the one being considered, there is otherwise no justifiable basis for making any particular assumptions about the effect size in the present study. In order to have a good idea about the effect size, the argument goes, we have to actually run the study, at which point obviously the power analysis is no longer needed. Convinced by this reasoning, many researchers throw up their hands, decide that power analysis will not be useful here or perhaps ever, and just plan instead on collecting some loosely conventional sample size that depends on their research area, but is usually something like 20-30 observations per cell of the design. In other words, they fight the power.

I’m here to convince you that fighting the power is a self-defeating research habit.

You know more than you think before the study

The first premise of the argument against power analysis is that we know little or nothing about the effect size before the study has been run. On the contrary. In the year 2015 we can benefit from decades of meta-analyses that have summarized the typical effect sizes found in almost any imaginable corner of the research literature. We even have meta-meta-analyses of those meta-analyses. The effect size in your future study is likely to resemble the effect sizes of the past, and luckily for us, the meta-analytic data on typical effect sizes are vast.

I want to illustrate just how good our situation really is by considering what is probably our worst case scenario in terms of study design: The case where we know absolutely nothing whatsoever about the study to be run except that the topic matter could broadly be classified as “social psychology” or some related field. In that case, we can use the data from Richard, Bond, and Stokes-Zoota (2003), who conducted a meta-analysis of meta-analyses in the field of social psychology to determine the range of typical effect sizes across the entire field, involving some 25,000 individual studies published over 100 years in diverse research areas. While the focus of this meta-meta-analysis was the field of social psychology, I believe there is little reason to expect the distribution of typical effect sizes to be appreciably different in other areas of psychology, such as cognitive psychology (if you are aware of large-scale meta-analytic data to the contrary, please let me know). Anyway, the figure below summarizes the distribution of effect sizes that they found.

Their meta-analysis actually examined the effects on the Pearson’s r (correlation) scale, and the bumpy density curve in the left panel shows their aggregated data (copied/pasted from their Figure 1). The smooth curve overlaying that data is the best-fitting beta distribution1, on which the percentiles and other statistics are based, and the curve in the right panel is based on applying a standard conversion formula to the smooth curve in the left panel2.

What this shows is that, in the absence of any other information about the study you are about to run, a pretty reasonable assumption about the effect size is that it is equal to the historical average: = 0.21 or d = 0.45. Or you could use the median, or be conservative and go with the 30th percentile, or whatever you want. The point is, we have enough information to make a pretty well-informed decision even if we have no specific information at all about the actual study.

Of course, in most cases in the real world, you probably do know something about the study you are about to run. In almost all cases, that knowledge will allow you to make an even more refined estimate of the effect size, either by finding a meta-analysis that looks more specifically at effects that are conceptually similar to yours (you could even start with Richard et al., who helpfully break down the average effect size in social psychology by broad research area), or just by starting with the aggregate historical estimate and adjusting from there based on how you think your study differs from the average study of the past.

You know less than you think after the study

The argument that opened this post pointed out that we don’t know the effect size before the study has been run. That’s true, but of course, we don’t know the effect size after the study has been run either. Instead what we have is some data from which we can construct an estimate of the effect size. Realizing this allows us to ask the quantitative question: Just how good of an effect size estimate do we have at the end of a typically-sized experiment? If our estimate of the effect size after an initial study is not much better than what we could already surmise based on the historical, meta-analytic data, then it doesn’t make a lot of sense to trust the former a lot more than the latter.

Consider a typical study in which we compare two independent groups with n=30 participants per group. Below I’ve simulated some data in which the standardized mean difference between the two groups is exactly equal to the historical average of d = 0.45. The figure below shows a bootstrap sampling distribution of the effect size in this situation3. If we ignore all prior information that we have about typical values of the effect size, as many researchers routinely do, then this sampling distribution summarizes everything we know about the effect size after running a pretty typical study.

Compare this distribution to the right panel of the first Figure from above, which showed our prior knowledge about the likely values of d. In terms of how much information they carry about d, the two distributions are really not that different. The sampling distribution is slightly less variable—it has a standard deviation of 0.27 rather than 0.37—but this difference in variability is quite hard to see from visual inspection.

Living with uncertainty

Whether we use historical data or data from previous experiments we have run, there will always be some uncertainty about the effect size. So there are a range of plausible assumptions we could make about the effect size when doing a power analysis, and these different assumptions imply different sample sizes to collect in the study. In many cases, the uncertainty will be pretty high, so that the range of recommended sample sizes will be quite wide, a fact which many researchers find disconcerting.

Uncertainty is a fact of scientific life and should be no cause for dismay. We have all (hopefully) learned to be comfortable with uncertainty in other aspects of the research process. Unfortunately, many researchers seem oddly unwilling to accept even modest uncertainty in the planning phase of the research. In responding to such a view, it’s hard to put it better than @gung did in this answer on Cross Validated:

“Regarding the broader claim that power analyses (a-priori or otherwise) rely on assumptions, it is not clear what to make of that argument. Of course they do. So does everything else. Not running a power analysis, but just gathering an amount of data based on a number you picked out of a hat, and then analyzing your data, will not improve the situation.”

Uncertainty is there whether we like it or not. We should try to make the best design decisions possible in light of that uncertainty. Power analysis is our best tool for doing so. Before I close the post, let me clarify: In my opinion, there is nothing wrong with planning experiments based on rules of thumb. I acknowledge that much of the time it won’t make sense to do a formal power analysis for each and every experiment, because often we won’t have a lot of specific information about the particular study we’re about to run beyond the kind of general information we have about the typical experiments we tend to run. My point is that we should apply statistically well-informed rules of thumb that are based on historical, meta-analytic data, and are calibrated to work pretty well in a range of realistic research situations—not dubious heuristics like an n=30 rule. One of the most important functions of power analysis is to help us construct such good rules of thumb.

1 For those interested, the parameters of this beta distribution are about $\alpha=1.34, \beta=5.03$.

2 The correct conversion from Pearson’s r to Cohen’s d depends on the assumed proportion of participants in the two groups. The statistics that I present in the figure are based on the standard formula that assumes the group sizes are equal. I experimented with various ways of relaxing that assumption in a realistic manner, but ultimately found that the difference was negligible unless one assumes the group sizes tend to be markedly and unrealistically unequal.

3 The mean shown in the figure is the mean of the bootstrap distribution. This mean is slightly higher than the assumed value of 0.45 because the sampling distribution of d is slightly positively skewed, reflecting the fact that sample d is a slightly positively biased estimate of population d.

4 Thanks to Katie Wolsiefer for this figure caption, which is way better than my original.

Follow-up: What about Uri’s 2n rule?

This post is a quick follow-up to yesterday’s post on sample size rules. Basically I thought it was a little too long to go in the comments section so here it is.

Some people on twitter (and in my blog comments) remarked that my conclusion appears to fly in the face of some things Uri Simonsohn wrote on a similar topic not too long ago. Briefly, Uri writes of a Study 1 in which there are two independent groups (A1 and B1) and some non-zero effect size, and a Study 2 in which we move to a 2×2 design in which the difference between conditions A1 and B1 is the same size as before, but the difference between the other two conditions A2 and B2 is exactly 0, and we are now testing the interaction effect. Uri concludes: “To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.” Let’s call this Uri’s 2n rule*.

I thought Uri’s post was cool and certainly don’t think my point contradicts the point he made. The important thing to note here is that the effect size that Uri assumes to be constant in both studies is just the A1-B1 difference. But that’s not the effect we’re actually testing in Study 2: we’re testing the interaction effect, i.e., the A1 – B1 – A2 + B2 difference. And there is no assumption that the effect size for the interaction in Study 2 is equal to the effect size for the A1-B1 difference in Study 1. In fact, the 2n rule depends on the relevant effect size in Study 2 being half the relevant effect size in Study 1. That’s why you must increase the sample size when moving to Study 2; for the situation Uri’s talking about, the relevant effect size gets cut in half in Study 2. In my post I’m talking about cases where the relevant effect size is held constant. If the relevant effect size is held constant, then adding cells to the design has a negligible impact on power.

Numerical example

Consider the following table of cell means and standard deviations (the latter in parentheses).

Let’s say there are 20 subjects in each cell. Now if Study 1 involves only groups A1 and B1 (so that there is N=40 in total) then the power of the study is 34%. And if Study 2 involves all four groups (so that there is N=80 in total), then the power to detect the interaction effect is only 20%. But if we double the sample size (so that there is n=40 in each cell and N=160 in total), then the power to detect the interaction effect is 34%, just as it was in Study 1. This is the 2n rule that Uri wrote about, and I don’t dispute it.

But now let’s look at the standardized effect sizes for the two studies. We use Cohen’s d, defined as $d=(\mu_1-\mu_2)/\sigma$, where $\mu_1$ and $\mu_2$ are the two means being compared and $\sigma$ is the pooled standard deviation, or equivalently, the root mean squared error (RMSE). Computing this in Study 1 is straightforward since there are only two groups; we have $d=\frac{0.5-0}{1} = 0.5$.

Computing d in Study 2 is a little less clear since in that case we have four groups and not two. We saw above that the relevant mean difference is the A1 – B1 – A2 + B2 difference. The key here is to realize that the interaction effect essentially still comes down to a comparison of two groups: We are comparing the A1 and B2 groups (which have coefficients of +1 in this difference score) against the A2 and B1 groups (which have coefficients of -1). So the two relevant means to use in computing d are the mean of the A1 and B2 means, and the mean of the A2 and B1 means. This gives us $d=(\frac{0.5+0}{2}-\frac{0+0}{2})/1=0.25$ for the interaction effect. In other words, the relevant effect size in Study 2 is half of what the relevant effect size in Study 1 was.

It’s easy to show this symbolically. Let $d_1$ be the effect size in Study 1 and $d_2$ be the effect size in Study 2. Then, starting with the classical definition of d,

$d_2=\frac{\mu_1-\mu_2}{\sigma}=\frac{\frac{\mu_{A1}+\mu_{B2}}{2}-\frac{\mu_{A2}+\mu_{B1}}{2}}{\sigma}=\frac{(\mu_{A1}-\mu_{B1})-(\mu_{A2}-\mu_{B2})}{2\sigma}$.

In this example we’ve assumed the $\mu_{A2}-\mu_{B2}$ difference is 0, so that

$d_2=\frac{\mu_{A1}-\mu_{B1}}{2\sigma}=\frac{d_1}{2}$.

* Robert Abelson wrote about a similar phenomenon which he called the 42% rule: If the A1-B1 difference just reaches significance at p=.05 with size d, then the A2-B2 difference has to be at least 42% as large as d in the opposite direction for the interaction in Study 2 to reach p=.05 with the same sample size.

Think about total N, not n per cell

The bottom line of this post is simple. There are lots of rules of thumb out there for minimum sample sizes to use in between-subjects factorial experiments. But they are virtually always formulated in terms of the sample size per cell, denoted as small n. For example, one ubiquitous informal rule suggests using about n=30 or so. The problem is that cursory power analysis shows that rules based on small n don’t really make sense in general, because what is more directly relevant for power is the total sample size, denoted as big N. So if you must rely on rules of thumb—which I actually don’t have any big problems with—try to use sample size rules based on big N, not small n.

The example

The idea of writing about this came from a recent interaction with a colleague, which interaction I describe here in slightly simplified form. My colleague was reviewing a paper and asked my advice about something the authors wrote concerning the power of the study. The study involved a 2×2×2 between-subjects design with a total sample size of N=128, and the authors had remarked in their manuscript that such a study should have statistical power of about 80% to detect a canonical “medium” effect size of Cohen’s d = 0.5.

This did not seem right to my colleague: “This seems impossible. I recently read a paper that clearly said that, in a simple two-independent-groups design, one needs about n=64 in each group to have 80% power to detect a medium effect size. But in this study the sample size works out to only n=16 in each cell! Intuitively it seems like the required sample size for a given power level should increase as the design becomes more complicated, or at least stay the same, but definitely not decrease. So their power calculation must be flawed, right?”

The intuition seems reasonable enough when stated this way, but in fact it isn’t true. The problem here is the assumption that the relevant sample size for power purposes is the sample size per cell, small n.

Big N vs. small n

Mathematically, it’s a little tricky to say that power is a function of N rather than n. After all, one can write the relevant equations in terms of N or in terms of n, so in that sense power is a function of whichever one you prefer. The argument here is that power is a much more natural and simple function of N than it is a function of n, so that rules of thumb based on N are far more useful than rules of thumb based on n.

One could justify this argument rather formally by looking at the symbolic expression of the noncentrality parameter written in different ways. But really the most straightforward and probably most compelling way to see that it’s true is just to specify two sample size rules, one based on N and one based on n, and to compare the statistical power resulting from each rule for a few different designs and effect sizes, which is what I’ve done in the table below.

Here I’ve chosen an N=128 rule just to be consistent with the example from before, but the general conclusion is clear. Using a rule based on N, power is a pretty simple and well-behaved function of the effect size alone, regardless of the particular between-subjects factorial design being considered. Using a rule based on n, power remains a more complicated, joint function of the effect size and the factorial structure.

Final caveat / technical footnote

Here, for the sake of simplicity, I’ve restricted myself to examining designs where all the factors have 2 levels, sometimes called 2k factorials. In between-subjects factorials where some of the factors have >2 levels, the appropriate sample size rule is slightly more complicated in that it depends on the particular contrast being tested. In these cases, a rule based on the total number of observations that are actually involved in the contrast—which we might call N’—works pretty well as a simple approximation in most cases. The more technically correct (but more complicated) procedure depends on the product of N and the variance of the contrast; see this working paper of mine for more details.

The hierarchical ordering principle

(Background reading: Jeff Rouder’s blog post on the “dominance principle”, which this post is mainly a response to.)

How many subjects should we routinely aim to recruit when we conduct experiments using within-subjects designs? It’s well known that within-subjects experiments tend to be more sensitive than between-subjects experiments for detecting mean differences of a comparable size, so one reasonable albeit vague answer is: “fewer than we would recruit if it were a between-subjects experiment.”

Interestingly, there is no logically or mathematically necessary reason why this truism of experimental design must be true in general. That is, it’s theoretically possible for the within-subjects version of an experiment to be no more powerful than the between-subjects version, or even for it to be less powerful. But, as a matter of empirical fact, the within-subjects experiment typically is more powerful, often substantially so. This blog post is about different possible explanations for why this is usually true.

Distinct Sources of Random Variation

Before we can explain the observation described above, we need to clarify in more precise terms what it is we’re trying to explain. To do so, we must clearly distinguish the estimable sources of random variation that perturb subjects’ responses in a simple within-subjects experiment. To illustrate, below is a small, made-up batch of data from a simple design involving 10 subjects measured both before (“pre-test”) and after (“post-test”) undergoing some Treatment, with each subjects’ two responses tied together.

Subject variance: Some subjects are “high responders” who tend to have high scores on average, other subjects are “low responders” who tend to have low scores. Subject variance is captured by the variance of the subject means. In terms of the plot above, this corresponds to the variance of the mid-points of the 10 lines.

Subject-by-Treatment interaction variance: Some subjects tend to show large Treatment effects (big increases from pre-test to post-test) while other subjects tend to show small or even negative Treatment effects. In terms of the plot above, imagine that we characterized the slope of each line with a number giving the difference from pre-test to post-test. The Subject-by-Treatment interaction variance is captured by the variance of those slopes or differences.

Error variance: Imagine that rather than measuring each subject only once at pre-test and once at post-test, we had measured each subject multiple times before and after treatment. Naturally we would still expect some random fluctuation in a particular subject’s pre-test scores and some fluctuation in that subject’s post-test scores, due to measurement error or whatever else. This variation is error variance—it cannot be accounted for by any other variable measured in the experiment. In the present example, where we supposed that subjects were measured only a single time at pre-test and post-test, we still imagine that the responses are perturbed by some amount of error variance. However, without multiple measurements at each time-point, we cannot statistically distinguish the error variance from the SxT interaction variance; the design confounds them.

The Hierarchical Ordering Principle

Now we’re in a position to more precisely describe the empirical pattern discussed at the beginning of this post. Ready? Here we go.

In a between-subjects experiment, the relevant sources of variation—that is, the sources of variation that end up in the standard error of the overall Treatment effect—are Error variance and Subject variance. (The Subject-by-Treatment variance can’t be estimated in that case because we only observed each subject at pre-test or post-test, but not both.) But in a within-subjects experiment, the relevant sources of variation are Error variance and Subject-by-Treatment variance. (The variance in the subject means can be estimated in this case, but it is no longer relevant, since intuitively all that matters is each subject’s pre-test to post-test difference.)

Now, we assume the size of the effect being studied is the same regardless of whether we used a between- or within-subjects design. So if the within-subjects experiment is more powerful than the between-subjects experiment, this implies that the Subject-by-Treatment interaction variance is smaller than the Subject variance. This is exactly what the “hierarchical ordering principle” says:

“Hierarchical ordering […] is a term denoting the observation that main effects tend to be larger on average than two-factor interactions, two-factor interactions tend to be larger on average than three-factor interactions, and so on” (Li, Sudarsanam, & Frey, 2006, p. 34).

Empirically speaking, this seems to be true much of the time. But why? Below I examine three arguments for why we might expect to observe hierarchical ordering more often than not. As you read through the arguments below, remember what we’re about here: We’re trying to explain hierarchical ordering because this would in turn be an explanation for why within-subjects designs are typically more powerful than between-subjects designs. The arguments below are far from decisive, but they are interesting and plausible.

1. With small effects, linearity dominates

One possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly due to the range over which experimenters typically explore factors. In the limit that experimenters explore small changes in factors and to the degree that systems exhibit continuity of responses and their derivatives, linear effects of factors tend to dominate. Therefore, to the extent that hierarchical ordering is common in experimentation, it is due to the fact that many experiments are conducted for the purpose of minor refinement rather than broad-scale exploration” (p. 34)

Huh? Actually this is not too hard to understand if we consider the situation graphically. The basic idea is illustrated in the figure below.

Imagine that we’re studying some dependent variable Y as a function of two independent variables X and Z. The effects that X and Z have on Y might have any functional form: they could have a joint, multiplicative (interactive) effect on Y, they could individually have diminishing or asymptoting effects on Y, or any other crazy, nonlinear sort of effects we might imagine. I set up the example above so that the true relationship is actually an ideal interaction: Y = XZ.

The purpose of the plot is to illustrate that for any particular point in the predictor space (any pair of X and Z values), in a sufficiently small neighborhood around that point, the function is approximately linear. In the plot, we pick a point and zoom in on it closer and closer so that the local neighborhood we’re considering becomes smaller and smaller. As we do so, we can see that the function immediately around that point comes to resemble more and more closely a tangent plane to the response function.

In an experiment, our goal is to push around X and Z to some extent and observe the corresponding changes in Y. But if we’re studying small effects—so that we push around predictor values X and Z only a small amount from condition to condition—then we are staying within a small neighborhood, in which case any higher-order, nonlinear effects of X and Z (such as interactions) are going to account for little of the observed variance in Y compared to the linear effects of X and Z, even if the “true” response curve is highly nonlinear across a broader range of predictor values. Of course, in psychological research, small effects are extremely common, maybe even the norm. To the extent that this is true, the argument goes, we should expect higher-order terms like Subject-by-Treatment interaction variance to account for relatively little variance in the outcome, and lower-order terms like Subject variance to account for relatively more variance.

1. Experimenters transform variables so that simple effects are predicted

Another possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly determined by the ability of experimenters to transform the inputs and outputs of the system to obtain a parsimonious description of system behavior […] For example, it is well known to aeronautical engineers that the lift and drag of wings is more simply described as a function of wing area and aspect ratio than by wing span and chord. Therefore, when conducting experiments to guide wing design, engineers are likely to use the product of span and chord (wing area) and the ratio of span and chord (the aspect ratio) as the independent variables” (p. 34).

This process described by Li et al. (2006) certainly happens in psychology as well. For example, in priming studies in which participants respond to prime-target stimulus pairs, it is common in my experience for researchers to code the “prime type” and “target type” factors in such an experiment so that the classic priming effect is represented as a main effect of prime-target congruency vs. incongruency, rather than as a prime type × target type interaction. And in social psychology, there are many studies that involve a my-group-membership × your-group-membership interaction effect, which is often better characterized and coded as a main effect of ingroup (group congruency) vs. outgroup (group incongruency). It seems natural to expect individual differences in these more robust effects to have greater variance than individual differences in the incidental effects, which are now coded as higher-order interactions, and this would give rise to hierarchical ordering in the random variance components. Other transformations that could have similar effects are things like analyzing the logarithm of response times, the square-root of count variables, etc.

1. Nested designs confound higher-order effects into lower-order effects

A final argument is that lower-order effects, like the Subject variance in our simple within-subjects design, often implicitly contain more confounded sources of variation than higher-order effects. For example, we saw above that in the between-subjects version of our experiment, the Subject-by-Treatment interaction variance could not be estimated. It turns out that this variance is implicitly confounded with the Subject variance (actually, all three sources of variance are confounded if we observe each subject only once) so that the Subject variance we observe—let’s call it var(S’)—is really the sum of the Subject variance, var(S), and the Subject-by-Treatment variance, var(SxT), so that var(S’) = var(S) + var(SxT).

This confounding may not seem obvious at first, but it’s pretty easy to see graphically. Consider the totally contrived dataset below. These data are generated under a within-subjects design where the subjects have exactly 0 variance in their mean responses. But imagine that instead of observing the complete data in a within-subjects design, we observed only the pre-test scores for half of the subjects and only the post-test scores for the other half of the subjects, in a between-subjects design. It would appear in that case that some subjects are high responders and others low responders—that is, it would appear that there is stable Subject variance var(S)—but obviously this is totally due to unobserved variance in the subject slopes, var(SxT).

The same idea extends to higher-order interaction effects. If we have a random two-way interaction that we can conceive of as being part of an incompletely observed three-way interaction, then we can suppose that the variance in that two-way interaction implicitly contains the variance of that unobserved three-way interaction. If we then assume that there’s a limit to how high up the design hierarchy we can plausibly take this process, this implies that, in designs in which there is some nesting of factors (including, notably, all between-subjects designs), lower-order effects will tend to have relatively greater variance by virtue of being implicit sums of a greater number of confounded higher-order interaction effects.

References

Li, X., Sudarsanam, N., & Frey, D. D. (2006). Regularities in data from factorial experiments. Complexity, 11(5), 32–45.