This post is a quick follow-up to yesterday’s post on sample size rules. Basically I thought it was a little too long to go in the comments section so here it is.
Some people on twitter (and in my blog comments) remarked that my conclusion appears to fly in the face of some things Uri Simonsohn wrote on a similar topic not too long ago. Briefly, Uri writes of a Study 1 in which there are two independent groups (A1 and B1) and some non-zero effect size, and a Study 2 in which we move to a 2×2 design in which the difference between conditions A1 and B1 is the same size as before, but the difference between the other two conditions A2 and B2 is exactly 0, and we are now testing the interaction effect. Uri concludes: “To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.” Let’s call this Uri’s 2n rule*.
I thought Uri’s post was cool and certainly don’t think my point contradicts the point he made. The important thing to note here is that the effect size that Uri assumes to be constant in both studies is just the A1-B1 difference. But that’s not the effect we’re actually testing in Study 2: we’re testing the interaction effect, i.e., the A1 – B1 – A2 + B2 difference. And there is no assumption that the effect size for the interaction in Study 2 is equal to the effect size for the A1-B1 difference in Study 1. In fact, the 2n rule depends on the relevant effect size in Study 2 being half the relevant effect size in Study 1. That’s why you must increase the sample size when moving to Study 2; for the situation Uri’s talking about, the relevant effect size gets cut in half in Study 2. In my post I’m talking about cases where the relevant effect size is held constant. If the relevant effect size is held constant, then adding cells to the design has a negligible impact on power.
Consider the following table of cell means and standard deviations (the latter in parentheses).
Let’s say there are 20 subjects in each cell. Now if Study 1 involves only groups A1 and B1 (so that there is N=40 in total) then the power of the study is 34%. And if Study 2 involves all four groups (so that there is N=80 in total), then the power to detect the interaction effect is only 20%. But if we double the sample size (so that there is n=40 in each cell and N=160 in total), then the power to detect the interaction effect is 34%, just as it was in Study 1. This is the 2n rule that Uri wrote about, and I don’t dispute it.
But now let’s look at the standardized effect sizes for the two studies. We use Cohen’s d, defined as , where and are the two means being compared and is the pooled standard deviation, or equivalently, the root mean squared error (RMSE). Computing this in Study 1 is straightforward since there are only two groups; we have .
Computing d in Study 2 is a little less clear since in that case we have four groups and not two. We saw above that the relevant mean difference is the A1 – B1 – A2 + B2 difference. The key here is to realize that the interaction effect essentially still comes down to a comparison of two groups: We are comparing the A1 and B2 groups (which have coefficients of +1 in this difference score) against the A2 and B1 groups (which have coefficients of -1). So the two relevant means to use in computing d are the mean of the A1 and B2 means, and the mean of the A2 and B1 means. This gives us for the interaction effect. In other words, the relevant effect size in Study 2 is half of what the relevant effect size in Study 1 was.
It’s easy to show this symbolically. Let be the effect size in Study 1 and be the effect size in Study 2. Then, starting with the classical definition of d,
In this example we’ve assumed the difference is 0, so that
* Robert Abelson wrote about a similar phenomenon which he called the 42% rule: If the A1-B1 difference just reaches significance at p=.05 with size d, then the A2-B2 difference has to be at least 42% as large as d in the opposite direction for the interaction in Study 2 to reach p=.05 with the same sample size.