# Designing multi-lab replication projects: Number of labs matters more than number of participants

In a multi-lab replication project, multiple teams of investigators team up to all run the same study (or studies) concurrently at different research sites. The best examples of this in psychology are the various Many Labs projects. There are lots of reasons why multi-lab replication projects are great. For example, they allow us to estimate and potentially model any between-site variability in the effect size, so we learn more about the generality of the effect. Another reason is that they can have greater statistical power than single-lab studies — as long as they involve a large enough sample size of labs. The point of this blog post is to underscore this point about the number of labs needed to ensure high statistical power.

## The verbal (intuitive?) explanation

We’re used to thinking about the number of participants as being, apart from the effect size, the chief factor in determining the statistical power of a study. And most of the time, in the kinds of single-lab studies we tend to run, this is basically true. Other factors matter as well — for example, the number of times we observe each participant, and the proportion of observations in each cell of the design — but as long as these other factors are within a typically sane range, they tend not to matter as much as the number of participants.

So it is perhaps natural that we apply this same heuristic to multi-lab replication projects. We reason that even if each lab can only recruit, say, 100 participants, and even if we can only recruit, say, 5 labs, this will give us a total of 500 participants, so statistical power will still be quite high! Right?

But here’s the thing. The reason the number of participants has such a big impact on power in the single-lab case is that, in those cases, the participants are the highest-level units or clusters in the design. That is to say, we can potentially have multiple observations of each participant — for example, we collect 50 reaction times from each participant — but these observations are clustered within participants, which are the high-level units. It turns out that the proper generalization of the “number of participants is important” heuristic is, in fact, “the number of highest-level units is important — the lower-level units, not as much.”

So now consider a multi-lab replication project. Here, the participants are clustered in labs. So the labs are the highest-level units. Remember the earlier example about having a study with 5 labs, each with 100 participants? In terms of its statistical power, this would be about like running a single-lab study with 5 participants, each of whom contributes 100 reaction times. In other words, it wouldn’t be great.

## The quantitative view

Let’s look at some actual power results. We consider a simple multi-lab design where we have $m$ labs, each of which recruits $n$ participants that are divided into two conditions ($n/2$ participants per condition), and we observe each participant only a single time. In other words, we have a simple two-group between-subject experiment that is being replicated at $m$ different labs, and the labs have random effects. The key quantity for determining the power of the study is $\delta$, the noncentrality parameter (for a non-central t distribution). It looks like this:

$\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

where $d$ is the effect size (Cohen’s d), $E$ is the proportion of random variation due to error variance (i.e., the ratio of the Error variance over the [weighted] sum of all the variance components), and $L$ is the proportion of random variation due to Lab variance (actually, it’s the proportion of Lab-by-Condition interaction variance, but I’m calling it Lab variance for short). Statistical power is pretty much determined by the noncentrality parameter — there’s technically also some influence of the degrees of freedom, but that tends not to matter much as long as it is above, say, 20 or so. So basically, we can understand the power of this multi-lab replication design by considering this noncentrality parameter expression.

First, let’s just plug a range of plausible values into the variables comprising $\delta$ and see what values of statistical power they imply. Here’s a series of contour plots where we vary the variables within plausible ranges.

The middle panel represents what I think are the most plausible values. There are a couple of interesting things to point out about these power results. The first is that…

### Increasing the number of labs usually raises power more quickly than increasing the number of participants per lab

The way to see this in the graphs is to consider the angle of the contours in each plot. More specifically, for any given point (i.e., pair of sample sizes) in any of the plots, consider the direction in which we would want to step to increase power fastest. For most parameter combinations, the path of steepest ascent up the power surface goes up along the y-axis (number of labs) more than it goes sideways along the x-axis (participants per lab). This is especially true when there is a lot of Lab variance ($L=10\%$, in the right-hand column), but is still usually true when there is little Lab variance ($L=1\%$, in the left-hand column).

There is another way of visualizing this that makes this point more clear. It uses the idea of indifference curves — technically speaking, curves where the rate of change in the noncentrality parameter w.r.t. the number of labs is equal to the rate of change w.r.t. participants per lab. The way these indifference curves are plotted below, studies that are below the relevant indifference curve would get a greater power benefit from increasing the number of labs, and studies that are above the relevant indifference curve would get a greater power benefit from increasing the number of participants per lab. For studies that lie exactly on the indifference curve, power would increase just as fast by increasing the number of labs as by increasing the number of participants per lab.

As you can see, most of the time there is a greater benefit (i.e., statistical power will increase faster) to increasing the number of labs. This is especially true if there is a lot of Lab variance. But it tends to be true even when there is little Lab variance. The cases where it makes more sense to increase the number of participants per lab are when you already have a large number of labs but a small number of participants per lab. And let’s face it, your multi-lab replication project is probably not in this part of the space.

### Increasing the number of participants per lab — but holding constant the number of labs — will not, in general, cause statistical power to approach 100%

This one is important. We tend to assume that as long as we continue to recruit participants, eventually we will have high statistical power. So even if we didn’t recruit as many labs for our project as we had hoped, we should be able to compensate for this by just recruiting more participants per lab — right? Unfortunately it isn’t so. The truth is that if we hold constant the number of labs, statistical power does not approach 100% as the number of participants per lab approaches infinity. Instead, power approaches some maximum attainable power value that can possibly be quite small, depending on the effect size, number of labs, and the Lab variance.

This can actually be seen pretty easily by considering again the expression of the noncentrality parameter:

$\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}$

In the limit as $n$ approaches infinity, the term in the denominator involving $E$ disappears, but the term involving $L$ does not, so the whole noncentrality parameter converges to a finite and possibly small value. Here’s what the situation looks like graphically for a handful of representative values of the variables:

The curves in each panel are unlabeled because, honestly, they don’t really matter. (If you want to know the power values for specific parameter combinations, the first graph is better for that anyway.) The point is just to show that when we increase the number of labs, power does eventually approach 100%, and the values of the other variables simply affect how quickly this happens. But when we increase the number of participants per lab — but, crucially, hold constant the number of labs — power sometimes approaches a value close to 100%, but often does not. The maximum attainable power will tend to be low when (a) the effect size is small, (b) the number of labs is small, (c) the Lab variance is high.

## Conclusion

The main conclusion is pretty simple. Multi-lab replication projects are great, and you should do them… but when you’re designing them, you should really try hard to recruit as many labs to collaborate in the project as you possibly can. You actually don’t need that many participants from each lab — the first figure shows that unless the Lab variance is tiny, you get quite diminished returns by recruiting more than 100 or so participants per lab — so perhaps this is a selling point that you can use to convince more labs to join your project (“we would only need you to run a few dozen participants!”).

If you want to do your own power analyses for multi-lab designs like this, or for other more complicated designs, you can use my PANGEA app.

# Don’t fight the power (analysis)

Researchers often feel uneasy about using power analysis to design their actual experiments because of uncertainty about the effect size in the study to be run. A common sentiment that one hears goes something like:

“I can’t do a power analysis because I have no idea what the effect size is. If I knew the effect size, I wouldn't have to run the study in the first place!”

The implication of this view is that, unless one has actually done experiments in the past that are pretty similar to the one being considered, there is otherwise no justifiable basis for making any particular assumptions about the effect size in the present study. In order to have a good idea about the effect size, the argument goes, we have to actually run the study, at which point obviously the power analysis is no longer needed. Convinced by this reasoning, many researchers throw up their hands, decide that power analysis will not be useful here or perhaps ever, and just plan instead on collecting some loosely conventional sample size that depends on their research area, but is usually something like 20-30 observations per cell of the design. In other words, they fight the power.

I’m here to convince you that fighting the power is a self-defeating research habit.

#### You know more than you think before the study

The first premise of the argument against power analysis is that we know little or nothing about the effect size before the study has been run. On the contrary. In the year 2015 we can benefit from decades of meta-analyses that have summarized the typical effect sizes found in almost any imaginable corner of the research literature. We even have meta-meta-analyses of those meta-analyses. The effect size in your future study is likely to resemble the effect sizes of the past, and luckily for us, the meta-analytic data on typical effect sizes are vast.

I want to illustrate just how good our situation really is by considering what is probably our worst case scenario in terms of study design: The case where we know absolutely nothing whatsoever about the study to be run except that the topic matter could broadly be classified as “social psychology” or some related field. In that case, we can use the data from Richard, Bond, and Stokes-Zoota (2003), who conducted a meta-analysis of meta-analyses in the field of social psychology to determine the range of typical effect sizes across the entire field, involving some 25,000 individual studies published over 100 years in diverse research areas. While the focus of this meta-meta-analysis was the field of social psychology, I believe there is little reason to expect the distribution of typical effect sizes to be appreciably different in other areas of psychology, such as cognitive psychology (if you are aware of large-scale meta-analytic data to the contrary, please let me know). Anyway, the figure below summarizes the distribution of effect sizes that they found.

Their meta-analysis actually examined the effects on the Pearson’s r (correlation) scale, and the bumpy density curve in the left panel shows their aggregated data (copied/pasted from their Figure 1). The smooth curve overlaying that data is the best-fitting beta distribution1, on which the percentiles and other statistics are based, and the curve in the right panel is based on applying a standard conversion formula to the smooth curve in the left panel2.

What this shows is that, in the absence of any other information about the study you are about to run, a pretty reasonable assumption about the effect size is that it is equal to the historical average: = 0.21 or d = 0.45. Or you could use the median, or be conservative and go with the 30th percentile, or whatever you want. The point is, we have enough information to make a pretty well-informed decision even if we have no specific information at all about the actual study.

Of course, in most cases in the real world, you probably do know something about the study you are about to run. In almost all cases, that knowledge will allow you to make an even more refined estimate of the effect size, either by finding a meta-analysis that looks more specifically at effects that are conceptually similar to yours (you could even start with Richard et al., who helpfully break down the average effect size in social psychology by broad research area), or just by starting with the aggregate historical estimate and adjusting from there based on how you think your study differs from the average study of the past.

#### You know less than you think after the study

The argument that opened this post pointed out that we don’t know the effect size before the study has been run. That’s true, but of course, we don’t know the effect size after the study has been run either. Instead what we have is some data from which we can construct an estimate of the effect size. Realizing this allows us to ask the quantitative question: Just how good of an effect size estimate do we have at the end of a typically-sized experiment? If our estimate of the effect size after an initial study is not much better than what we could already surmise based on the historical, meta-analytic data, then it doesn’t make a lot of sense to trust the former a lot more than the latter.

Consider a typical study in which we compare two independent groups with n=30 participants per group. Below I’ve simulated some data in which the standardized mean difference between the two groups is exactly equal to the historical average of d = 0.45. The figure below shows a bootstrap sampling distribution of the effect size in this situation3. If we ignore all prior information that we have about typical values of the effect size, as many researchers routinely do, then this sampling distribution summarizes everything we know about the effect size after running a pretty typical study.

Compare this distribution to the right panel of the first Figure from above, which showed our prior knowledge about the likely values of d. In terms of how much information they carry about d, the two distributions are really not that different. The sampling distribution is slightly less variable—it has a standard deviation of 0.27 rather than 0.37—but this difference in variability is quite hard to see from visual inspection.

Many researchers are hesitant to rely on power analyses based on historical information about d, but feel much more comfortable relying on a power analysis based on the sample d from an initial study like this. I submit that this doesn’t make a lot of sense because we don’t really have much more information about the likely values of d after running the initial study than we already had about d before running a single participant. In fact, for small pilot studies, the sampling distribution of d might actually be more variable than our prior distribution for d based on historical, meta-analytic data. Below is a figure that compares the variability of sample d (as a function of total sample size N) to the variability of our prior beliefs about d. We can see that the two distributions have the same standard deviation at approximately N=30. One way to view this is that, in the absence of any other information about the study to be run, we have about as much prior information about d as if we had run a pilot study with N=30 (and ignored all prior information rather than updating our beliefs about d in a Bayesian fashion).

#### Living with uncertainty

Whether we use historical data or data from previous experiments we have run, there will always be some uncertainty about the effect size. So there are a range of plausible assumptions we could make about the effect size when doing a power analysis, and these different assumptions imply different sample sizes to collect in the study. In many cases, the uncertainty will be pretty high, so that the range of recommended sample sizes will be quite wide, a fact which many researchers find disconcerting.

Uncertainty is a fact of scientific life and should be no cause for dismay. We have all (hopefully) learned to be comfortable with uncertainty in other aspects of the research process. Unfortunately, many researchers seem oddly unwilling to accept even modest uncertainty in the planning phase of the research. In responding to such a view, it’s hard to put it better than @gung did in this answer on Cross Validated:

“Regarding the broader claim that power analyses (a-priori or otherwise) rely on assumptions, it is not clear what to make of that argument. Of course they do. So does everything else. Not running a power analysis, but just gathering an amount of data based on a number you picked out of a hat, and then analyzing your data, will not improve the situation.”

Uncertainty is there whether we like it or not. We should try to make the best design decisions possible in light of that uncertainty. Power analysis is our best tool for doing so. Before I close the post, let me clarify: In my opinion, there is nothing wrong with planning experiments based on rules of thumb. I acknowledge that much of the time it won’t make sense to do a formal power analysis for each and every experiment, because often we won’t have a lot of specific information about the particular study we’re about to run beyond the kind of general information we have about the typical experiments we tend to run. My point is that we should apply statistically well-informed rules of thumb that are based on historical, meta-analytic data, and are calibrated to work pretty well in a range of realistic research situations—not dubious heuristics like an n=30 rule. One of the most important functions of power analysis is to help us construct such good rules of thumb.

1 For those interested, the parameters of this beta distribution are about $\alpha=1.34, \beta=5.03$.

2 The correct conversion from Pearson’s r to Cohen’s d depends on the assumed proportion of participants in the two groups. The statistics that I present in the figure are based on the standard formula that assumes the group sizes are equal. I experimented with various ways of relaxing that assumption in a realistic manner, but ultimately found that the difference was negligible unless one assumes the group sizes tend to be markedly and unrealistically unequal.

3 The mean shown in the figure is the mean of the bootstrap distribution. This mean is slightly higher than the assumed value of 0.45 because the sampling distribution of d is slightly positively skewed, reflecting the fact that sample d is a slightly positively biased estimate of population d.

4 Thanks to Katie Wolsiefer for this figure caption, which is way better than my original.

# Think about total N, not n per cell

The bottom line of this post is simple. There are lots of rules of thumb out there for minimum sample sizes to use in between-subjects factorial experiments. But they are virtually always formulated in terms of the sample size per cell, denoted as small n. For example, one ubiquitous informal rule suggests using about n=30 or so. The problem is that cursory power analysis shows that rules based on small n don’t really make sense in general, because what is more directly relevant for power is the total sample size, denoted as big N. So if you must rely on rules of thumb—which I actually don’t have any big problems with—try to use sample size rules based on big N, not small n.

The example

The idea of writing about this came from a recent interaction with a colleague, which interaction I describe here in slightly simplified form. My colleague was reviewing a paper and asked my advice about something the authors wrote concerning the power of the study. The study involved a 2×2×2 between-subjects design with a total sample size of N=128, and the authors had remarked in their manuscript that such a study should have statistical power of about 80% to detect a canonical “medium” effect size of Cohen’s d = 0.5.

This did not seem right to my colleague: “This seems impossible. I recently read a paper that clearly said that, in a simple two-independent-groups design, one needs about n=64 in each group to have 80% power to detect a medium effect size. But in this study the sample size works out to only n=16 in each cell! Intuitively it seems like the required sample size for a given power level should increase as the design becomes more complicated, or at least stay the same, but definitely not decrease. So their power calculation must be flawed, right?”

The intuition seems reasonable enough when stated this way, but in fact it isn’t true. The problem here is the assumption that the relevant sample size for power purposes is the sample size per cell, small n.

Big N vs. small n

Mathematically, it’s a little tricky to say that power is a function of N rather than n. After all, one can write the relevant equations in terms of N or in terms of n, so in that sense power is a function of whichever one you prefer. The argument here is that power is a much more natural and simple function of N than it is a function of n, so that rules of thumb based on N are far more useful than rules of thumb based on n.

One could justify this argument rather formally by looking at the symbolic expression of the noncentrality parameter written in different ways. But really the most straightforward and probably most compelling way to see that it’s true is just to specify two sample size rules, one based on N and one based on n, and to compare the statistical power resulting from each rule for a few different designs and effect sizes, which is what I’ve done in the table below.

Here I’ve chosen an N=128 rule just to be consistent with the example from before, but the general conclusion is clear. Using a rule based on N, power is a pretty simple and well-behaved function of the effect size alone, regardless of the particular between-subjects factorial design being considered. Using a rule based on n, power remains a more complicated, joint function of the effect size and the factorial structure.

Final caveat / technical footnote

Here, for the sake of simplicity, I’ve restricted myself to examining designs where all the factors have 2 levels, sometimes called 2k factorials. In between-subjects factorials where some of the factors have >2 levels, the appropriate sample size rule is slightly more complicated in that it depends on the particular contrast being tested. In these cases, a rule based on the total number of observations that are actually involved in the contrast—which we might call N’—works pretty well as a simple approximation in most cases. The more technically correct (but more complicated) procedure depends on the product of N and the variance of the contrast; see this working paper of mine for more details.

# The hierarchical ordering principle

(Background reading: Jeff Rouder’s blog post on the “dominance principle”, which this post is mainly a response to.)

How many subjects should we routinely aim to recruit when we conduct experiments using within-subjects designs? It’s well known that within-subjects experiments tend to be more sensitive than between-subjects experiments for detecting mean differences of a comparable size, so one reasonable albeit vague answer is: “fewer than we would recruit if it were a between-subjects experiment.”

Interestingly, there is no logically or mathematically necessary reason why this truism of experimental design must be true in general. That is, it’s theoretically possible for the within-subjects version of an experiment to be no more powerful than the between-subjects version, or even for it to be less powerful. But, as a matter of empirical fact, the within-subjects experiment typically is more powerful, often substantially so. This blog post is about different possible explanations for why this is usually true.

Distinct Sources of Random Variation

Before we can explain the observation described above, we need to clarify in more precise terms what it is we’re trying to explain. To do so, we must clearly distinguish the estimable sources of random variation that perturb subjects’ responses in a simple within-subjects experiment. To illustrate, below is a small, made-up batch of data from a simple design involving 10 subjects measured both before (“pre-test”) and after (“post-test”) undergoing some Treatment, with each subjects’ two responses tied together.

Subject variance: Some subjects are “high responders” who tend to have high scores on average, other subjects are “low responders” who tend to have low scores. Subject variance is captured by the variance of the subject means. In terms of the plot above, this corresponds to the variance of the mid-points of the 10 lines.

Subject-by-Treatment interaction variance: Some subjects tend to show large Treatment effects (big increases from pre-test to post-test) while other subjects tend to show small or even negative Treatment effects. In terms of the plot above, imagine that we characterized the slope of each line with a number giving the difference from pre-test to post-test. The Subject-by-Treatment interaction variance is captured by the variance of those slopes or differences.

Error variance: Imagine that rather than measuring each subject only once at pre-test and once at post-test, we had measured each subject multiple times before and after treatment. Naturally we would still expect some random fluctuation in a particular subject’s pre-test scores and some fluctuation in that subject’s post-test scores, due to measurement error or whatever else. This variation is error variance—it cannot be accounted for by any other variable measured in the experiment. In the present example, where we supposed that subjects were measured only a single time at pre-test and post-test, we still imagine that the responses are perturbed by some amount of error variance. However, without multiple measurements at each time-point, we cannot statistically distinguish the error variance from the SxT interaction variance; the design confounds them.

The Hierarchical Ordering Principle

Now we’re in a position to more precisely describe the empirical pattern discussed at the beginning of this post. Ready? Here we go.

In a between-subjects experiment, the relevant sources of variation—that is, the sources of variation that end up in the standard error of the overall Treatment effect—are Error variance and Subject variance. (The Subject-by-Treatment variance can’t be estimated in that case because we only observed each subject at pre-test or post-test, but not both.) But in a within-subjects experiment, the relevant sources of variation are Error variance and Subject-by-Treatment variance. (The variance in the subject means can be estimated in this case, but it is no longer relevant, since intuitively all that matters is each subject’s pre-test to post-test difference.)

Now, we assume the size of the effect being studied is the same regardless of whether we used a between- or within-subjects design. So if the within-subjects experiment is more powerful than the between-subjects experiment, this implies that the Subject-by-Treatment interaction variance is smaller than the Subject variance. This is exactly what the “hierarchical ordering principle” says:

“Hierarchical ordering […] is a term denoting the observation that main effects tend to be larger on average than two-factor interactions, two-factor interactions tend to be larger on average than three-factor interactions, and so on” (Li, Sudarsanam, & Frey, 2006, p. 34).

Empirically speaking, this seems to be true much of the time. But why? Below I examine three arguments for why we might expect to observe hierarchical ordering more often than not. As you read through the arguments below, remember what we’re about here: We’re trying to explain hierarchical ordering because this would in turn be an explanation for why within-subjects designs are typically more powerful than between-subjects designs. The arguments below are far from decisive, but they are interesting and plausible.

1. With small effects, linearity dominates

One possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly due to the range over which experimenters typically explore factors. In the limit that experimenters explore small changes in factors and to the degree that systems exhibit continuity of responses and their derivatives, linear effects of factors tend to dominate. Therefore, to the extent that hierarchical ordering is common in experimentation, it is due to the fact that many experiments are conducted for the purpose of minor refinement rather than broad-scale exploration” (p. 34)

Huh? Actually this is not too hard to understand if we consider the situation graphically. The basic idea is illustrated in the figure below.

Imagine that we’re studying some dependent variable Y as a function of two independent variables X and Z. The effects that X and Z have on Y might have any functional form: they could have a joint, multiplicative (interactive) effect on Y, they could individually have diminishing or asymptoting effects on Y, or any other crazy, nonlinear sort of effects we might imagine. I set up the example above so that the true relationship is actually an ideal interaction: Y = XZ.

The purpose of the plot is to illustrate that for any particular point in the predictor space (any pair of X and Z values), in a sufficiently small neighborhood around that point, the function is approximately linear. In the plot, we pick a point and zoom in on it closer and closer so that the local neighborhood we’re considering becomes smaller and smaller. As we do so, we can see that the function immediately around that point comes to resemble more and more closely a tangent plane to the response function.

In an experiment, our goal is to push around X and Z to some extent and observe the corresponding changes in Y. But if we’re studying small effects—so that we push around predictor values X and Z only a small amount from condition to condition—then we are staying within a small neighborhood, in which case any higher-order, nonlinear effects of X and Z (such as interactions) are going to account for little of the observed variance in Y compared to the linear effects of X and Z, even if the “true” response curve is highly nonlinear across a broader range of predictor values. Of course, in psychological research, small effects are extremely common, maybe even the norm. To the extent that this is true, the argument goes, we should expect higher-order terms like Subject-by-Treatment interaction variance to account for relatively little variance in the outcome, and lower-order terms like Subject variance to account for relatively more variance.

1. Experimenters transform variables so that simple effects are predicted

Another possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly determined by the ability of experimenters to transform the inputs and outputs of the system to obtain a parsimonious description of system behavior […] For example, it is well known to aeronautical engineers that the lift and drag of wings is more simply described as a function of wing area and aspect ratio than by wing span and chord. Therefore, when conducting experiments to guide wing design, engineers are likely to use the product of span and chord (wing area) and the ratio of span and chord (the aspect ratio) as the independent variables” (p. 34).

This process described by Li et al. (2006) certainly happens in psychology as well. For example, in priming studies in which participants respond to prime-target stimulus pairs, it is common in my experience for researchers to code the “prime type” and “target type” factors in such an experiment so that the classic priming effect is represented as a main effect of prime-target congruency vs. incongruency, rather than as a prime type × target type interaction. And in social psychology, there are many studies that involve a my-group-membership × your-group-membership interaction effect, which is often better characterized and coded as a main effect of ingroup (group congruency) vs. outgroup (group incongruency). It seems natural to expect individual differences in these more robust effects to have greater variance than individual differences in the incidental effects, which are now coded as higher-order interactions, and this would give rise to hierarchical ordering in the random variance components. Other transformations that could have similar effects are things like analyzing the logarithm of response times, the square-root of count variables, etc.

1. Nested designs confound higher-order effects into lower-order effects

A final argument is that lower-order effects, like the Subject variance in our simple within-subjects design, often implicitly contain more confounded sources of variation than higher-order effects. For example, we saw above that in the between-subjects version of our experiment, the Subject-by-Treatment interaction variance could not be estimated. It turns out that this variance is implicitly confounded with the Subject variance (actually, all three sources of variance are confounded if we observe each subject only once) so that the Subject variance we observe—let’s call it var(S’)—is really the sum of the Subject variance, var(S), and the Subject-by-Treatment variance, var(SxT), so that var(S’) = var(S) + var(SxT).

This confounding may not seem obvious at first, but it’s pretty easy to see graphically. Consider the totally contrived dataset below. These data are generated under a within-subjects design where the subjects have exactly 0 variance in their mean responses. But imagine that instead of observing the complete data in a within-subjects design, we observed only the pre-test scores for half of the subjects and only the post-test scores for the other half of the subjects, in a between-subjects design. It would appear in that case that some subjects are high responders and others low responders—that is, it would appear that there is stable Subject variance var(S)—but obviously this is totally due to unobserved variance in the subject slopes, var(SxT).

The same idea extends to higher-order interaction effects. If we have a random two-way interaction that we can conceive of as being part of an incompletely observed three-way interaction, then we can suppose that the variance in that two-way interaction implicitly contains the variance of that unobserved three-way interaction. If we then assume that there’s a limit to how high up the design hierarchy we can plausibly take this process, this implies that, in designs in which there is some nesting of factors (including, notably, all between-subjects designs), lower-order effects will tend to have relatively greater variance by virtue of being implicit sums of a greater number of confounded higher-order interaction effects.

References

Li, X., Sudarsanam, N., & Frey, D. D. (2006). Regularities in data from factorial experiments. Complexity, 11(5), 32–45.