This post is a quick follow-up to yesterday’s post on sample size rules. Basically I thought it was a little too long to go in the comments section so here it is.

Some people on twitter (and in my blog comments) remarked that my conclusion appears to fly in the face of some things Uri Simonsohn wrote on a similar topic not too long ago. Briefly, Uri writes of a Study 1 in which there are two independent groups (A1 and B1) and some non-zero effect size, and a Study 2 in which we move to a 2×2 design in which the difference between conditions A1 and B1 is the same size as before, but the difference between the other two conditions A2 and B2 is exactly 0, and we are now testing the interaction effect. Uri concludes: “To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.” Let’s call this Uri’s 2*n* rule*.

I thought Uri’s post was cool and certainly don’t think my point contradicts the point he made. The important thing to note here is that the effect size that Uri assumes to be constant in both studies is just the A1-B1 difference. But that’s not the effect we’re actually testing in Study 2: we’re testing the interaction effect, i.e., the A1 – B1 – A2 + B2 difference. And there is no assumption that the effect size for the interaction in Study 2 is equal to the effect size for the A1-B1 difference in Study 1. In fact, the 2*n* rule depends on the relevant effect size in Study 2 being *half* the relevant effect size in Study 1. That’s why you must increase the sample size when moving to Study 2; for the situation Uri’s talking about, the relevant effect size gets cut in half in Study 2. In my post I’m talking about cases where the relevant effect size is held constant. If the relevant effect size is held constant, then adding cells to the design has a negligible impact on power.

**Numerical example**

Consider the following table of cell means and standard deviations (the latter in parentheses).

Let’s say there are 20 subjects in each cell. Now if Study 1 involves only groups A1 and B1 (so that there is *N*=40 in total) then the power of the study is 34%. And if Study 2 involves all four groups (so that there is *N*=80 in total), then the power to detect the interaction effect is only 20%. But if we double the sample size (so that there is *n*=40 in each cell and *N*=160 in total), then the power to detect the interaction effect is 34%, just as it was in Study 1. This is the 2*n* rule that Uri wrote about, and I don’t dispute it.

But now let’s look at the standardized effect sizes for the two studies. We use Cohen’s *d*, defined as , where and are the two means being compared and is the pooled standard deviation, or equivalently, the root mean squared error (RMSE). Computing this in Study 1 is straightforward since there are only two groups; we have .

Computing *d* in Study 2 is a little less clear since in that case we have four groups and not two. We saw above that the relevant mean difference is the A1 – B1 – A2 + B2 difference. The key here is to realize that the interaction effect essentially still comes down to a comparison of two groups: We are comparing the A1 and B2 groups (which have coefficients of +1 in this difference score) against the A2 and B1 groups (which have coefficients of -1). So the two relevant means to use in computing *d* are the mean of the A1 and B2 means, and the mean of the A2 and B1 means. This gives us for the interaction effect. In other words, the relevant effect size in Study 2 is half of what the relevant effect size in Study 1 was.

It’s easy to show this symbolically. Let be the effect size in Study 1 and be the effect size in Study 2. Then, starting with the classical definition of *d,*

.

In this example we’ve assumed the difference is 0, so that

.

* Robert Abelson wrote about a similar phenomenon which he called the 42% rule: If the A1-B1 difference just reaches significance at *p*=.05 with size *d*, then the A2-B2 difference has to be at least 42% as large as *d* in the *opposite direction* for the interaction in Study 2 to reach p=.05 with the same sample size.

Jake-

Thanks for the response. I am not a master at interaction contrasts so perhaps you can educate me a bit.

It seems that when you compare the mean of two cells to the mean of the other two cells that you are calculating d for a main effect. Essentially, your d of 0.25 is the main effect of Factor X or Factor Z, which makes sense if you think about what the marginal means would be for your data. The mean of Column B (or Row 1) would be 0.25 standard deviations greater than the mean of Column A (or Row 2).

Here is how I think about the interaction effect for this sort of design. When you apply the contrast effect weights to the cell means you get A1-B1-A2+B2, just like what you got. Rearranging the terms gives you (A1-B1)-(A2-B2) or (A1-A2)-(B1-B2). This is the “difference of difference” scores, which is a useful way to conceptualize an interaction effect. Then you standardize this contrast with a pooled standard deviation, which with a completely between-subjects design would be the square root of MS within. This would give you an interaction effect of d=0.5 for these data. This makes sense to me because you can see that the difference between A1 and B1 (0-0.5) is one half of one standard deviation greater than the difference between A2 and B2 (0-0).

I am not sure where you are getting the 2’s in the denominators, but think that may be the source of my confusion. BTW: My calculations came from Kline (2004).

Hi Randy, thanks for commenting. As you note, there are other possible ways we could define an effect size in Study 2. Basically all standardized effect sizes are just made-up quantities that we use because we think they have more sensible and desirable properties for certain purposes than the unstandardized effects. For a given unstandardized effect, there are any number of ways we could “standardize” that effect, and the only real basis we have for choosing among these different effect size definitions is in choosing the one that has the most sensible derivation and the most desirable properties relative to other candidates. I believe that the effect size I use has a more sensible derivation and more desirable properties than the alternative effect size that you mentioned, as I elaborate below.

The effect size that you discuss for the interaction in Study 2 is . This seems reasonable enough on its face, but I submit that on closer inspection it doesn’t make the most sense. One issue is that if we adopt such an effect size definition, then we put ourselves in the strange situation where, even though the 3 conventional effects for this design (2 main effects and 1 interaction) all boil down to simple two-group comparisons and are thus naturally comparable, we have now arbitrarily defined a different measure of effect size for the interaction compared to the main effects. A consequence of this is that the effect size for the interaction is on a totally different scale and is thus not directly comparable to the effect sizes for the main effects. It’s hard to see why this would be a desirable property of an effect size definition. Maybe more importantly, it also makes the fact that the Study 2 interaction effect size would equal the Study 1 simple effect size pretty meaningless if they are on different scales.

To see why this puts the interaction and the main effects on different and non-comparable scales, consider the following. We can easily imagine extensions of this definition to higher designs such as 2x2x2, 2x2x2x2, and so on. In these cases the numerator of your effect size will contain more and more group means. Clearly as the number of means grows larger and larger, we expect on average a correspondingly bigger number in the numerator, simply because we’ve put more things in the numerator. Intuitively, we want to correct for this fact by also increasing the denominator by a proportional factor, so that the expected effect size is determined by just the magnitude of the differences and not the number of groups. That’s what my effect size does.

The effect size that I use is an incredibly straightforward extension of the classic definition of Cohen’s

d. We literally use the classic definition but just with and . (That’s technically where the 2 comes from in the 2×2 case.) In addition to the straightforward derivation, the effect size that I use keeps all of the effects in this 2×2 design on the same scale so that they are directly comparable. This is especially important in light of the fact that it is almost totally arbitrary what we choose to consider the main effects and what we choose to consider the interaction in a 2×2 design, as I discuss a little in this previous post of mine (see point #2 about two-thirds of the way into the post). The basic idea behind the effect size that I use is: All the effects in this case are two-group comparisons, and so we should just treat them all as two-group comparisons; that is, in a consistent way that is equivalent to classical definitions of effect size for two-group comparisons.So anyway, I hope I’ve convinced you that, while it is technically possible to choose other definitions of effect size that would be constant across our hypothetical Studies 1 and 2, these alternative definitions don’t really make a lot of sense, for all the reasons I elaborated above.

Thanks for taking the time to respond to a complete stranger. I enjoy your ideas. Keep ’em coming.

Jake-I have had more time to think about the differences in how these interaction effects might be computed. And now I have more questions. I apologize for focusing the comments on this topic because I know this was not the main point of your blog (which is intriguing by the way). I hope you interpret my comments as a compliment; you really got me thinking today and I thank you for that.

I believe that an effect size for a 2×2 interaction should express the magnitude that an effect of one variable changes over the levels of the other variable. For this reason, a “difference of difference” scores approach seems sensible to me. The equation that you proposed takes the differences between the means of two cells but does not seem to capture the extent to which an effect of one variable changes over the levels of another variable. By averaging the means of two cells you may be potentially losing the “interaction” component of the interaction effect.

Let’s look at a few concrete examples (where all cells have an SD of 1 for easy computing). In your example of a 2×2 interaction there were 3 cells with a mean of zero and 1 cell with a mean of 0.5. Everybody would agree this pattern of means is an interaction–the effect of A on B changes from level 1 to level 2. If this were plotted the lines would not be parallel. For this pattern of means my interaction effect was d=0.5 and your interaction effect was d=0.25.

Now imagine that cells A1 and B1 both have means of 0.25 (i.e., A1 = 0.25 and B1 = 0.25). And imagine that cells A2 and B2 both have a mean of zero (i.e., A2 = 0 and B2 = 0). I believe that we would agree that there would be no interaction in such a design. If this was plotted the lines would be parallel. In this pattern of means, my interaction effect would be d=0, which makes sense to me because the effect of A is the same across both levels of the other variable (and vice versa). Moreover, this effect of d=0 is the same regardless of which way you decide to compute the “difference of difference” scores (i.e., ((A1-A2)-(B1-B2)) ==((A1-B1)-(A2-B2))). I also believe this was Uri’s approach to computing the interaction effect sizes in his “no-way interactions” blog post.

In comparison, your interaction effect could be d=0.25 (if you took the difference of the average A1 and A2 and the average of B1 and B2; which would be 0.25-0 = 0.25) even though there would be no interaction present. Or it could be zero (if you took the difference of the average of A1 and B1 and the average of A2 and B2; which would be 0.125-0.125 = 0). In my mind, the interaction effect size in a 2×2 design shouldn’t change depending on which way you decide to compare the cells.

Am I missing something obvious here? Thanks in advance.

Hi Randy, I do take your comments as a compliment and not a challenge, thanks for the discussion ;)

I think all of your comments might stem from a simple misreading of the equation I wrote. Your comments seem to imply that the numerator of my effect size is (A1+B1)-(A2+B2). But it’s not…that would be the main effect of Factor Z from the diagram in my post…instead if you look again you’ll see that my numerator is (A1+B2)-(A2+B1), which you can algebraically rearrange to obtain your numerator, (A1-A2)-(B1-B2). In fact, my effect size is just your effect size divided by a constant; specifically, 2. So in the example you give, my effect size does not return d=0.25 as you suggest, it returns d=0, same as yours. I hope this helps to clear up the confusion.

Thank you for the fast response. We are on the same page now. I figured it was an easy math error on my part.

I’ve still got a lot to think about, but this certainly helped. Thanks, Jake.

For posterity: I just found this wonderful page on common misconceptions about factorial experiments which spends a lot of space making a similar argument to what I’ve said here, with some additional thoughts and explanation. Highly recommended!

Hi,

In the equation I see that you divide the means by the standard deviation. But which standard deviation should you use? Because you have four groups with four different standard deviations.

I was also wondering the same thing!