(Background reading: Jeff Rouder’s blog post on the “dominance principle”, which this post is mainly a response to.)

How many subjects should we routinely aim to recruit when we conduct experiments using within-subjects designs? It’s well known that within-subjects experiments tend to be more sensitive than between-subjects experiments for detecting mean differences of a comparable size, so one reasonable albeit vague answer is: “fewer than we would recruit if it were a between-subjects experiment.”

Interestingly, there is no logically or mathematically necessary reason why this truism of experimental design must be true in general. That is, it’s theoretically possible for the within-subjects version of an experiment to be no more powerful than the between-subjects version, or even for it to be *less* powerful. But, as a matter of empirical fact, the within-subjects experiment typically *is* more powerful, often substantially so. This blog post is about different possible explanations for why this is usually true.

**Distinct Sources of Random Variation**

Before we can explain the observation described above, we need to clarify in more precise terms what it is we’re trying to explain. To do so, we must clearly distinguish the estimable sources of random variation that perturb subjects’ responses in a simple within-subjects experiment. To illustrate, below is a small, made-up batch of data from a simple design involving 10 subjects measured both before (“pre-test”) and after (“post-test”) undergoing some Treatment, with each subjects’ two responses tied together.

**Subject variance**: Some subjects are “high responders” who tend to have high scores on average, other subjects are “low responders” who tend to have low scores. Subject variance is captured by the variance of the subject means. In terms of the plot above, this corresponds to the variance of the mid-points of the 10 lines.

**Subject-by-Treatment interaction variance**: Some subjects tend to show large Treatment effects (big increases from pre-test to post-test) while other subjects tend to show small or even negative Treatment effects. In terms of the plot above, imagine that we characterized the slope of each line with a number giving the difference from pre-test to post-test. The Subject-by-Treatment interaction variance is captured by the variance of those slopes or differences.

**Error variance**: Imagine that rather than measuring each subject only once at pre-test and once at post-test, we had measured each subject multiple times before and after treatment. Naturally we would still expect some random fluctuation in a particular subject’s pre-test scores and some fluctuation in that subject’s post-test scores, due to measurement error or whatever else. This variation is error variance—it cannot be accounted for by any other variable measured in the experiment. In the present example, where we supposed that subjects were measured only a single time at pre-test and post-test, we still imagine that the responses are perturbed by some amount of error variance. However, without multiple measurements at each time-point, we cannot statistically distinguish the error variance from the SxT interaction variance; the design confounds them.

**The Hierarchical Ordering Principle**

Now we’re in a position to more precisely describe the empirical pattern discussed at the beginning of this post. Ready? Here we go.

In a between-subjects experiment, the relevant sources of variation—that is, the sources of variation that end up in the standard error of the overall Treatment effect—are Error variance and Subject variance. (The Subject-by-Treatment variance can’t be estimated in that case because we only observed each subject at pre-test or post-test, but not both.) But in a within-subjects experiment, the relevant sources of variation are Error variance and Subject-by-Treatment variance. (The variance in the subject means *can* be estimated in this case, but it is no longer relevant, since intuitively all that matters is each subject’s pre-test to post-test difference.)

Now, we assume the size of the effect being studied is the same regardless of whether we used a between- or within-subjects design. So if the within-subjects experiment is more powerful than the between-subjects experiment, this implies that the Subject-by-Treatment interaction variance is smaller than the Subject variance. This is exactly what the “hierarchical ordering principle” says:

“Hierarchical ordering […] is a term denoting the observation that main effects tend to be larger on average than two-factor interactions, two-factor interactions tend to be larger on average than three-factor interactions, and so on” (Li, Sudarsanam, & Frey, 2006, p. 34).

Empirically speaking, this seems to be true much of the time. But why? Below I examine three arguments for why we might expect to observe hierarchical ordering more often than not. As you read through the arguments below, remember what we’re about here: We’re trying to explain hierarchical ordering because this would in turn be an explanation for why within-subjects designs are typically more powerful than between-subjects designs. The arguments below are far from decisive, but they are interesting and plausible.

**With small effects, linearity dominates**

One possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly due to the range over which experimenters typically explore factors. In the limit that experimenters explore small changes in factors and to the degree that systems exhibit continuity of responses and their derivatives, linear effects of factors tend to dominate. Therefore, to the extent that hierarchical ordering is common in experimentation, it is due to the fact that many experiments are conducted for the purpose of minor refinement rather than broad-scale exploration” (p. 34)

Huh? Actually this is not too hard to understand if we consider the situation graphically. The basic idea is illustrated in the figure below.

Imagine that we’re studying some dependent variable Y as a function of two independent variables X and Z. The effects that X and Z have on Y might have any functional form: they could have a joint, multiplicative (interactive) effect on Y, they could individually have diminishing or asymptoting effects on Y, or any other crazy, nonlinear sort of effects we might imagine. I set up the example above so that the true relationship is actually an ideal interaction: Y = XZ.

The purpose of the plot is to illustrate that for any particular point in the predictor space (any pair of X and Z values), in a sufficiently small neighborhood around that point, the function is approximately linear. In the plot, we pick a point and zoom in on it closer and closer so that the local neighborhood we’re considering becomes smaller and smaller. As we do so, we can see that the function immediately around that point comes to resemble more and more closely a *tangent plane* to the response function.

In an experiment, our goal is to push around X and Z to some extent and observe the corresponding changes in Y. But if we’re studying small effects—so that we push around predictor values X and Z only a small amount from condition to condition—then we are staying within a small neighborhood, in which case any higher-order, nonlinear effects of X and Z (such as interactions) are going to account for little of the observed variance in Y compared to the linear effects of X and Z, even if the “true” response curve is highly nonlinear across a broader range of predictor values. Of course, in psychological research, small effects are extremely common, maybe even the norm. To the extent that this is true, the argument goes, we should expect higher-order terms like Subject-by-Treatment interaction variance to account for relatively little variance in the outcome, and lower-order terms like Subject variance to account for relatively more variance.

**Experimenters transform variables so that simple effects are predicted**

Another possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly determined by the ability of experimenters to transform the inputs and outputs of the system to obtain a parsimonious description of system behavior […] For example, it is well known to aeronautical engineers that the lift and drag of wings is more simply described as a function of wing area and aspect ratio than by wing span and chord. Therefore, when conducting experiments to guide wing design, engineers are likely to use the product of span and chord (wing area) and the ratio of span and chord (the aspect ratio) as the independent variables” (p. 34).

This process described by Li et al. (2006) certainly happens in psychology as well. For example, in priming studies in which participants respond to prime-target stimulus pairs, it is common in my experience for researchers to code the “prime type” and “target type” factors in such an experiment so that the classic priming effect is represented as a main effect of prime-target congruency vs. incongruency, rather than as a prime type × target type interaction. And in social psychology, there are many studies that involve a my-group-membership × your-group-membership interaction effect, which is often better characterized and coded as a main effect of ingroup (group congruency) vs. outgroup (group incongruency). It seems natural to expect individual differences in these more robust effects to have greater variance than individual differences in the incidental effects, which are now coded as higher-order interactions, and this would give rise to hierarchical ordering in the random variance components. Other transformations that could have similar effects are things like analyzing the logarithm of response times, the square-root of count variables, etc.

**Nested designs confound higher-order effects into lower-order effects**

A final argument is that lower-order effects, like the Subject variance in our simple within-subjects design, often implicitly contain more confounded sources of variation than higher-order effects. For example, we saw above that in the between-subjects version of our experiment, the Subject-by-Treatment interaction variance could not be estimated. It turns out that this variance is implicitly confounded with the Subject variance (actually, all three sources of variance are confounded if we observe each subject only once) so that the Subject variance we observe—let’s call it var(S’)—is really the *sum* of the Subject variance, var(S), and the Subject-by-Treatment variance, var(SxT), so that var(S’) = var(S) + var(SxT).

This confounding may not seem obvious at first, but it’s pretty easy to see graphically. Consider the totally contrived dataset below. These data are generated under a within-subjects design where the subjects have exactly 0 variance in their mean responses. But imagine that instead of observing the complete data in a within-subjects design, we observed only the pre-test scores for half of the subjects and only the post-test scores for the other half of the subjects, in a between-subjects design. It would appear in that case that some subjects are high responders and others low responders—that is, it would appear that there is stable Subject variance var(S)—but obviously this is totally due to unobserved variance in the subject slopes, var(SxT).

The same idea extends to higher-order interaction effects. If we have a random two-way interaction that we can conceive of as being part of an incompletely observed three-way interaction, then we can suppose that the variance in that two-way interaction implicitly contains the variance of that unobserved three-way interaction. If we then assume that there’s a limit to how high up the design hierarchy we can plausibly take this process, this implies that, in designs in which there is some nesting of factors (including, notably, all between-subjects designs), lower-order effects will tend to have relatively greater variance by virtue of being implicit sums of a greater number of confounded higher-order interaction effects.

**References**

Li, X., Sudarsanam, N., & Frey, D. D. (2006). Regularities in data from factorial experiments. *Complexity*, *11*(5), 32–45.

Nice read!

The quote from Li et al. reminds me of the following citation from the groundbreaking book by McCullagh & Nelder (Generalized Linear Models, Chapman & Hall, 1989): “It is important that the final model or models should make sense physically: at a minimum, this usually means that interactions should not be included without main effects nor higher-degree polynomial terms without their lower-degree relatives. Furthermore, if the model is to be used as a summary of the findings of one out of several studies bearing on the same phenomenon, main effects would usually be included whether significant or not. Strict adherence to this policy makes it easier to compare the results of various studies and helps to avoid the apparent conflicts that occur when different fitted models with different sets of terms are used in each study.”

Even in the exceptional cases when it is not directly empirically the case that the main effects out-effect the interactions, I would always opt for hierarchical models.