The bottom line of this post is simple. There are lots of rules of thumb out there for minimum sample sizes to use in between-subjects factorial experiments. But they are virtually always formulated in terms of the sample size per cell, denoted as small n. For example, one ubiquitous informal rule suggests using about n=30 or so. The problem is that cursory power analysis shows that rules based on small n don’t really make sense in general, because what is more directly relevant for power is the total sample size, denoted as big N. So if you must rely on rules of thumb—which I actually don’t have any big problems with—try to use sample size rules based on big N, not small n.
The idea of writing about this came from a recent interaction with a colleague, which interaction I describe here in slightly simplified form. My colleague was reviewing a paper and asked my advice about something the authors wrote concerning the power of the study. The study involved a 2×2×2 between-subjects design with a total sample size of N=128, and the authors had remarked in their manuscript that such a study should have statistical power of about 80% to detect a canonical “medium” effect size of Cohen’s d = 0.5.
This did not seem right to my colleague: “This seems impossible. I recently read a paper that clearly said that, in a simple two-independent-groups design, one needs about n=64 in each group to have 80% power to detect a medium effect size. But in this study the sample size works out to only n=16 in each cell! Intuitively it seems like the required sample size for a given power level should increase as the design becomes more complicated, or at least stay the same, but definitely not decrease. So their power calculation must be flawed, right?”
The intuition seems reasonable enough when stated this way, but in fact it isn’t true. The problem here is the assumption that the relevant sample size for power purposes is the sample size per cell, small n.
Big N vs. small n
Mathematically, it’s a little tricky to say that power is a function of N rather than n. After all, one can write the relevant equations in terms of N or in terms of n, so in that sense power is a function of whichever one you prefer. The argument here is that power is a much more natural and simple function of N than it is a function of n, so that rules of thumb based on N are far more useful than rules of thumb based on n.
One could justify this argument rather formally by looking at the symbolic expression of the noncentrality parameter written in different ways. But really the most straightforward and probably most compelling way to see that it’s true is just to specify two sample size rules, one based on N and one based on n, and to compare the statistical power resulting from each rule for a few different designs and effect sizes, which is what I’ve done in the table below.
Here I’ve chosen an N=128 rule just to be consistent with the example from before, but the general conclusion is clear. Using a rule based on N, power is a pretty simple and well-behaved function of the effect size alone, regardless of the particular between-subjects factorial design being considered. Using a rule based on n, power remains a more complicated, joint function of the effect size and the factorial structure.
Final caveat / technical footnote
Here, for the sake of simplicity, I’ve restricted myself to examining designs where all the factors have 2 levels, sometimes called 2k factorials. In between-subjects factorials where some of the factors have >2 levels, the appropriate sample size rule is slightly more complicated in that it depends on the particular contrast being tested. In these cases, a rule based on the total number of observations that are actually involved in the contrast—which we might call N’—works pretty well as a simple approximation in most cases. The more technically correct (but more complicated) procedure depends on the product of N and the variance of the contrast; see this working paper of mine for more details.