Think about total N, not n per cell

The bottom line of this post is simple. There are lots of rules of thumb out there for minimum sample sizes to use in between-subjects factorial experiments. But they are virtually always formulated in terms of the sample size per cell, denoted as small n. For example, one ubiquitous informal rule suggests using about n=30 or so. The problem is that cursory power analysis shows that rules based on small n don’t really make sense in general, because what is more directly relevant for power is the total sample size, denoted as big N. So if you must rely on rules of thumb—which I actually don’t have any big problems with—try to use sample size rules based on big N, not small n.

The example

The idea of writing about this came from a recent interaction with a colleague, which interaction I describe here in slightly simplified form. My colleague was reviewing a paper and asked my advice about something the authors wrote concerning the power of the study. The study involved a 2×2×2 between-subjects design with a total sample size of N=128, and the authors had remarked in their manuscript that such a study should have statistical power of about 80% to detect a canonical “medium” effect size of Cohen’s d = 0.5.

This did not seem right to my colleague: “This seems impossible. I recently read a paper that clearly said that, in a simple two-independent-groups design, one needs about n=64 in each group to have 80% power to detect a medium effect size. But in this study the sample size works out to only n=16 in each cell! Intuitively it seems like the required sample size for a given power level should increase as the design becomes more complicated, or at least stay the same, but definitely not decrease. So their power calculation must be flawed, right?”

The intuition seems reasonable enough when stated this way, but in fact it isn’t true. The problem here is the assumption that the relevant sample size for power purposes is the sample size per cell, small n.

Big N vs. small n

Mathematically, it’s a little tricky to say that power is a function of N rather than n. After all, one can write the relevant equations in terms of N or in terms of n, so in that sense power is a function of whichever one you prefer. The argument here is that power is a much more natural and simple function of N than it is a function of n, so that rules of thumb based on N are far more useful than rules of thumb based on n.

One could justify this argument rather formally by looking at the symbolic expression of the noncentrality parameter written in different ways. But really the most straightforward and probably most compelling way to see that it’s true is just to specify two sample size rules, one based on N and one based on n, and to compare the statistical power resulting from each rule for a few different designs and effect sizes, which is what I’ve done in the table below.


Here I’ve chosen an N=128 rule just to be consistent with the example from before, but the general conclusion is clear. Using a rule based on N, power is a pretty simple and well-behaved function of the effect size alone, regardless of the particular between-subjects factorial design being considered. Using a rule based on n, power remains a more complicated, joint function of the effect size and the factorial structure.

Final caveat / technical footnote

Here, for the sake of simplicity, I’ve restricted myself to examining designs where all the factors have 2 levels, sometimes called 2k factorials. In between-subjects factorials where some of the factors have >2 levels, the appropriate sample size rule is slightly more complicated in that it depends on the particular contrast being tested. In these cases, a rule based on the total number of observations that are actually involved in the contrast—which we might call N’—works pretty well as a simple approximation in most cases. The more technically correct (but more complicated) procedure depends on the product of N and the variance of the contrast; see this working paper of mine for more details.

12 thoughts on “Think about total N, not n per cell

  1. Thanks Joe, this is a good question. I’ve just been investigating this a little and it’s interesting. Basically the issue is that in Uri’s discussion, the size of the group difference from his Study 1 is assumed to be constant. But the relevant effect size for power purposes in Study 2 is the interaction effect. And that is certainly not assumed to be constant. In fact, the relevant effect size in Study 2 is implicitly assumed to be half of what the relevant effect size in Study 1 was. I just wrote a follow-up blog post elaborating this in more detail.

    Thinking about this has been super useful for me as I continue to revise my manuscript on these kinds of issues (linked to in the blog post), so I want to thank you and all the others on Twitter for starting this discussion!

    P.S. This is an edited version of my first comment in which I said some stuff about different ways of defining the effect size d for a 2×2 design, but I realized none of that stuff really matters and it’s actually much more simple than I was making it, so I wrote over those comments.

  2. Jake,
    Is it always the power to test the higher order interaction (or the main effect for the 2 cell design)?

    1. Hi Dom, sorry for the delayed response, I just noticed that emails about comments on my blog have been going straight to my spam folder. Anyway, in this 2×2 design, all of the conventional effects are just two-group comparisons (as I elaborate a little in my follow-up post). Therefore, since the effect size is defined the same way for each effect and the columns of my tables hold constant the effect size, it doesn’t matter which effect we’re talking about. That is, whether it’s the interaction or either of the main effects, the results are identical. Hope this makes sense.

  3. Hi Jake–I’m sorry to bring up such an old discussion (you and the nice commenters were posting in 2015!).

    I may be misreading your point, but it seems that you’re suggesting that the power to detect a difference between two *specific* cells increases as you add more cells to the design.

    So, if I have 80% power to detect d = .73 (N = 60, n = 30) between groups 1 and 2, then by adding 6 more cells and randomly assigning 60 people to 1 of the 8 cells, I now have *more* power to detect the same effect size between the same conditions/cells. That can’t be correct, can it? Wouldn’t the simple effect contrast, 1 -1 0 0 0 0 0 lose power as those 60 people are randomly distributed across the 8 cells?

    To maintain 80% power to test the difference between these two cells, wouldn’t I have to fill the 8 cells with the same n as in the two-cell design?

    1. Hi Nick, no, I’m saying that power to detect any of the factorial effects — i.e., the main effects or interactions — is the same, given the same effect size. For other contrasts such as pairwise comparisons this is not generally true.

  4. While this is all correct it misses an important point. Using Cohen’s d for all effect sizes masks an issue. The interactions have much higher variability and require larger numerical effects to reach the Cohen’s d. The realized power will almost never be equivalent. It requires large cross over interactions. (I recognize Cohen’s d can be calculated multiple ways in this situation but the one that works for this evaluation with total N does have this issue.)

    Mark White has a couple of simple simulations that highlight issues with interactions.

Leave a Reply to jakewestfall Cancel reply

Your email address will not be published. Required fields are marked *