Logistic regression is not fucked

Summary: Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to. 

A few months ago I wrote a blog post on using causal graphs to understand missingness and how to deal with it, which concluded on a rather positive note:

“While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic…”

Well, today I write about another statistical issue with a basically optimistic message. So I decided this could be like the second installment in a “some things are not fucked” series, in which I look at a few issues/methods that some people have claimed are fucked, but which are in fact not fucked.1

In this installment we consider logistic regression — or, more generally, Binomial regression (including logit and probit regression).2

Wait, who said logistic regression was fucked?

The two most widely-read papers sounding the alarm bells seem to be Allison (1999) and Mood (2010). The alleged problems are stated most starkly by Mood (2010, pp. 67-68):

  1. “It is problematic to interpret [logistic regression coefficients] as substantive effects, because they also reflect unobserved heterogeneity.
  2. It is problematic to compare [logistic regression coefficients] across models with different independent variables, because the unobserved heterogeneity is likely to  vary across models.
  3. It is problematic to compare [logistic regression coefficients] across samples, across groups within samples, or over time—even when we use models with the same independent variables—because the unobserved heterogeneity can vary across the compared samples, groups, or points in time.”

These are pretty serious allegations.3  These concerns have convinced some people to abandon logistic regression in favor of the so-called linear probability model, which just means using classical regression directly on the binary outcome, although usually with heteroskedasticity-robust standard errors. To the extent that’s a bad idea—which, to be fair, is a matter of debate, but is probably ill-advised as a default method at the very least—it’s important that we set the record straight.

The allegations refer to “unobserved heterogeneity.” What exactly is this unobserved heterogeneity and where does it come from? There are basically two underlying lines of argument here that we must address. Both arguments lead to a similar conclusion—and previous sources have sometimes been a bit unclear by drawing from both arguments more or less interchangeably in order to reach this conclusion—but they rely on fundamentally distinct premises, so we must clearly distinguish these arguments and address them separately. The counterarguments I describe below are very much in the same spirit as those of Kuha and Mills (2017).

First argument: Heteroskedasticity in the latent outcome

A standard way of motivating the probit model for binary outcomes (e.g., from Wikipedia) is the following. We have an unobserved/latent outcome variable Y^* that is normally distributed, conditional on the predictor X. Specifically, the model for the ith observation is

Y_i^* = \beta_0^* + \beta_1^*X_i + \varepsilon_i\text{, with }\varepsilon_i \sim \text{Normal}(0, \sigma^2).

The latent variable Y^* is subjected to a thresholding process, so that the discrete outcome we actually observe is

Y_i = \begin{cases} 1, & \text{if}\ Y_i^* \ge T \\ 0, & \text{if}\ Y_i^* < T\end{cases}.

This leads the probability of Y=1 given X to take the form of a Normal CDF, with mean and standard deviation a function of \beta_1^*, \sigma^2, and the deterministic threshold T. So the probit model is basically motivated as a way of estimating \beta_1^* from this latent regression of Y^* on X, although on a different scale. This latent variable interpretation is illustrated in the plot below, from Thissen & Orlando (2001). These authors are technically discussing the normal ogive model from item response theory, which looks pretty much like probit regression for our purposes.

Note the differences in notation: these authors use \theta in place of X, \gamma in place of T, u in place of Y, \mu in place of Y^*, and they write probabilities with T() instead of the usual P() or Pr().

We can interpret logistic regression in pretty much exactly the same way. The only difference is that now the unobserved continuous Y^* follows not a normal distribution, but a similarly bell-shaped logistic distribution given X. A theoretical argument for why Y^* might follow a logistic distribution rather than a normal distribution is not so clear, but since the resulting logistic curve looks essentially the same as the normal CDF for practical purposes (after some rescaling), it won’t tend to matter much in practice which model you use. The point is that both models have a fairly straightforward interpretation involving a continuous latent variable Y^* and an unobserved, deterministic threshold T.

So the first argument for logistic regression being fucked due to “unobserved heterogeneity” comes from asking: What if the residual variance \sigma^2 is not constant, as assumed in the model above, but instead is different at different values of X? Well, it turns out that, as far as our observable binary outcome Y is concerned, this heteroskedasticity can be totally indistinguishable from the typically assumed situation where \sigma^2 is constant and Y^* is increasing (or decreasing) with X. This is illustrated in Figure 2 below.

Figure 2.

Suppose we are comparing the proportions of positive responses (i.e., Y=1) between two groups of observations, Group 1 and Group 2. To give this at least a little bit of context, maybe Group 2 are human subjects of some social intervention that attempts to increase participation in local elections, Group 1 is a control group, the observed binary Y is whether the person voted in a recent election, and the latent continuous Y^* is some underlying “propensity to vote.” Now we observe a voting rate of 10% in the control group (Group 1) and 25% in the experimental group (Group 2). In terms of our latent variable model, we’d typically assume this puts us in Scenario A from Figure 2: The intervention increased people’s “propensity to vote” (Y^*) on average, which pushed more people over the threshold T in Group 2 than in Group 1, which led to a greater proportion of voters in Group 2.

The problem is that these observed voting proportions can be explained equally well by assuming that the intervention had 0 effect on the mean propensity to vote, but instead just led to greater variance in the propensities to vote. As illustrated in Scenario B of Figure 2, this could just as well have led the voting proportion to increase from 10% in the control group to 25% in the experimental group, and it’s a drastically different (and probably less appealing) conceptual interpretation of the results.

Another possibility that would fit the data equally well (but isn’t discussed as often) is that the intervention had no effect at all on the distribution of Y^*, but instead just lowered the threshold for Group 2, so that even people with a lower “propensity to vote” were able to drag themselves to the polls. This is illustrated in Scenario C of Figure 2.

So you can probably see how this supports the three allegations cited earlier. When we observe a non-zero estimate for a logistic regression coefficient, we can’t be sure this actually reflects a shift in the mean of the underlying continuous latent variable (e.g., increased propensity to vote), because it also reflects latent heteroskedasticity, and we can’t tell these two explanations apart. And because the degree of heteroskedasticity could easily differ between models, between samples, or over time, even comparing logistic regression coefficients to one another is problematic…if shifts in the underlying mean are what we care about.

Are shifts in an underlying mean what we care about?

There’s the crux of the matter. This entire first line of argument presupposes that all of the following are true for the case at hand:

  1. It makes any conceptual sense to think of the observed binary Y as arising from a latent continuous Y^* and a deterministic threshold.
  2. We actually care how much of the observed effect on Y is due to mean shifts in Y^* vs. changes in the variance of Y^*.
  3. We’ve observed only a single binary indicator of Y^*, so that Scenarios A and B from Figure 2 are empirically indistinguishable.

In my experience, usually at least one of these is false. For example, if the observed binary Y indicates survival of patients in a medical trial, what exactly would an underlying Y^* represent? It could make sense for the patients who survived—maybe it represents their general health or something—but surely all patients with Y=0 are equally dead! Returning to the voting example, we can probably grant that #1 is true: it probably does make conceptual sense to think about an underlying, continuous “propensity to vote.” But #2 is probably false: I couldn’t care less if the social intervention increased voting by increasing propensity to vote, spreading out the distribution of voting propensities, or just altering the threshold that turns the propensity into voting behavior… I just want people to vote!

Finally, when #1 and #2 are true, so that the investigator is primarily interested not in the observed Y but rather in some underlying latent Y^*, in my experience the investigator will usually have taken care to collect data on multiple binary indicators of Y^*—in other words, #3 will be false. For example, if I were interested in studying an abstract Y^* like “political engagement,” I would certainly view voting as a binary indicator of that, but I would also try to use data on things like whether that person donated money to political campaigns, whether they attended any political conventions, and so on. And when there are multiple binary indicators of Y^*, it then becomes possible to empirically distinguish Scenario A from Scenario B in Figure 2, using, for example, statistical methods from item response theory.

These counterarguments are not to say that this first line of argument is invalid or irrelevant. The premises do lead to the conclusion, and there are certainly situations where those premises are true. If you find yourself in one of those situations, where #1-#3 are all true, then you do need to heed the warnings of Allison (1999) and Mood (2010). The point of these counterarguments is to say that, far more often than not, at least one of the premises listed above will be false. And in those cases, logistic regression is not fucked.

Okay, great. But we’re not out of the woods yet. As I mentioned earlier, there’s a second line of argument that leads us to essentially the same conclusions, but that makes no reference whatsoever to a continuous latent Y^*.

Second argument: Omitted non-confounders in logistic regression

To frame the second argument, first think back to the classical regression model with two predictors, a focal predictor X and some covariate C:

Y_i = \beta^A_0 + \beta^A_1X_i + \beta^A_2C_i + \varepsilon^A_i.

Now suppose we haven’t observed the covariate C, so that the regression model we actually estimate is

Y_i = \beta^B_0 + \beta^B_1X_i + \varepsilon^B_i.

In the special case where C is uncorrelated with X, we know that \beta^B_1 =\beta^A_1, so that our estimate of the slope for X will, on average, be the same either way. The technical name for this property is collapsibility: classical regression coefficients are said to be collapsible over uncorrelated covariates.

It turns out that logistic regression coefficients do not have this collapsibility property. If a covariate that’s correlated with the binary outcome is omitted from the logistic regression equation, then the slopes for the remaining observed predictors will be affected, even if the omitted covariate is uncorrelated with the observed predictors. Specifically, in the case of omitting an uncorrelated covariate, the observed slopes will be driven toward 0 to some extent.

This is all illustrated below in Figure 3, where the covariate C is shown as a binary variable (color = red vs. blue) for the sake of simplicity.

Figure 3. In the classical regression case, the simple/unconditional regression line (black) has the same slope as the group-specific regression lines (red and blue). In the logistic regression case, they are not equal: the simple/unconditional regression line is much more shallow than the group-specific regression lines. In both cases there is no confounding: the predictor X and the grouping factor (color = red vs. blue) have zero correlation, that is, X has the same mean among the red points and the blue points.

So now we can lay out the second line of argument against logistic regression. Actually, the most impactful way to communicate the argument is not to list out the premises, but instead to use a sort of statistical intuition pump. Consider the data in the right-hand panel of Figure 3. The slope (logistic regression coefficient) of X on Y is, let’s say, \beta=2 for both the red group and the blue group. But suppose the color grouping factor is not observed, so that we can only fit the simple/unconditional logistic regression that ignores the color groups. Because of the non-collapsibility of logistic regression coefficients, the slope from this regression (shown in black in Figure 2) is shallower, say, \beta=1. But if the slope is \beta=2 among both the red and the blue points, and if every point is either red or blue, then who exactly does this \beta=1 slope apply to? What is the substantive interpretation of this slope?

For virtually every logistic regression model that we estimate in the real world, there will be some uncorrelated covariates that are statistically associated with the binary outcome, but that we couldn’t observe to include in the model. In other words, there’s always unobserved heterogeneity in our data on covariates we couldn’t measure. But then—the argument goes—how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?

These are rhetorical questions. The implication is that no meaningful interpretation is possible—or, as Mood (2010, p. 67) puts it, “it is problematic to interpret [logistic regression coefficients] as substantive effects.” I beg to differ. As I argue next, we can interpret logistic regression coefficients perfectly well even in the face of non-collapsibility.

Logistic regression coefficients are about conditional probabilities

Specifically, we can write logistic regression models directly analogous to models A and B from above as:

f\big(\text{Pr}(Y_i=1|X_i=x,C_i=c)\big) = \beta^A_0 + \beta^A_1x + \beta^A_2c,

f\big(\text{Pr}(Y_i=1|X_i=x)\big) = \beta^B_0 + \beta^B_1x,

where f(\cdot) is the logit link function. As the left-hand-sides of these regression equations make clear, \beta^A_1 tells us about differences in the probability of Y as X increases conditional on the covariate C being fixed at some value c, while \beta^B_1 tells us about differences in the probability of Y as X increases marginal over C. There is no reason to expect these two things to coincide in general unless \text{Pr}(Y|X,C)=\text{Pr}(Y|X), which we know from probability theory is only true when Y and C are conditionally independent given X—in terms of our model, when \beta^A_2=0.

So now let’s return to the red vs. blue example of Figure 3. We supposed, for illustration’s sake, a slope of \beta^B_1=1 overall, ignoring the red vs. blue grouping. Then the first rhetorical question from before asked, “who exactly does this \beta^B_1=1 slope apply to?” The answer is that it applies to a population in which we know the X values but we don’t know the C values, that is, we don’t know the color of any of the data points. There’s an intuition that if \beta^A_1=2 among both the red and blue points, then for any new point whose color we don’t know, we ought to guess that the slope that applies to them is also 2. But that presupposes that we were able to estimate slopes among both the red and blue groups, which would imply that we did observe the colors of at least some of the points. On the contrary, let me repeat: the \beta^B_1=1 slope applies to a population in which we know the X values but we don’t know any of the C values. Put more formally, the \beta^B_1=1 slope refers to changes in \text{Pr}(Y|X) = \text{E}_C\big[\text{Pr}(Y|X,C)\big]; there is an intuition that these probabilities ought to equal \text{Pr}\big(Y|X,C=\text{E}[C]\big), but these are not the same because the latter still require conditioning on C.

The second rhetorical question from above asked, “how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?” The answer is that we interpret them conditional on all and only the covariates that were included in the model. Again, conceptually speaking, the coefficients refer to a population in which we know the values of the covariates represented in the model and nothing more. There’s no problem with comparing these coefficients between samples or over time as long as these coefficients refer to the same population, that is, populations where the same sets of covariates are observed.

As for comparing coefficients between models with different covariates? Here we must agree with Mood and Allison that, in most cases, these comparisons are probably not informative. But this is not because of “unobserved heterogeneity.” It’s because these coefficients refer to different populations of units. In terms of models A and B from above, \beta^A_1 and \beta^B_1 represent completely different conceptual quantities and it’s a mistake to view estimates of \beta^B_1 as somehow being deficient estimates of \beta^A_1. As a more general rule, parameters from different models usually mean different things—compare them at your peril. In the logistic regression case, there may be situations where it makes sense to compare estimates of \beta^A_1 with estimates of \beta^B_1, but not because one thinks they ought to be estimating the same quantity.

Footnotes and References

1 Or which, at least, are not fucked for the given reasons, although they could still be fucked for unrelated reasons.
2 This stuff is also true for some survival analysis models, notably Cox regression.
3 At least, I think they are… a definition of “substantive effects” is never given (are they like causal effects?), but presumably they’re something we want in an interpretation.

Allison, P. D. (1999). Comparing logit and probit coefficients across groupsSociological methods & research28(2), 186-208.

Kuha, J., & Mills, C. (2017). On group comparisons with logistic regression modelsSociological Methods & Research, 0049124117747306.

Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about itEuropean sociological review26(1), 67-82.

Pang, M., Kaufman, J. S., & Platt, R. W. (2013). Studying noncollapsibility of the odds ratio with marginal structural and logistic regression modelsStatistical methods in medical research25(5), 1925-1937.

Rohwer, G. (2012). Estimating effects with logit models. NEPS Working Paper 10, German National Educational Panel Study, University of Bamberg.

Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

What is the yearly risk of another Harvey-level flood in Houston?

You’ve likely heard that the flooding in Houston following Hurricane Harvey is reportedly a 500-year or even 1000-year flood event. You’ve perhaps also heard that this is the third 500-year flood that Houston has experienced in a three-year span, which calls into serious question the usefulness or accuracy of the “500-year flood” designation. This made me wonder: what’s our actual best estimate of the yearly risk for a Harvey-level flood, according to the data? That is the question I will attempt to answer here.

As a spoiler, I estimate that Harvey caused roughly 50 – 120 year flood events at most sites in Harris county (the Texas county where Houston is located) where we have data—but the estimates depend strongly on the specific site, with estimates at some sites being unremarkably small, and estimates at one site (Vince Bayou at Pasadena, TX) indeed hovering near the 1000 year neighborhood. In the rest of this post I will describe how I arrived at these estimates. I’ve posted an R Markdown HTML document which gives all the code for reproducing this analysis, plus some additional tidbits.

Disclaimer: I am not a hydrologist. In fact, I know almost nothing about hydrology. What I am is an out-of-work data scientist. I don’t claim that the analysis I present below is particularly savvy or sophisticated — it’s possible I’ve overlooked some issues that would be more obvious to the hydrologically initiated — but I do think my analysis at least addresses all or most of the major methodological issues that need to be handled in such an analysis.

The N-year flood concept

As the sources I linked above explain very clearly, the N-year flood is a probabilistic notion based on the estimated chance that flood waters will reach a certain height in any given year. In a nutshell, an N-year flood is defined as the flood water height that a given area has a 1/N probability of reaching in a particular year. For example, a 500-year flood has a probability 1/500 or 0.2% chance of occurring in a particular year, by definition. Now, it does not automatically follow that an N-year flood is expected to occur about every N years on average, but that conclusion does follow if we add the somewhat reasonable assumption that the flood events are independent from one year to the next1. That assumption is probably not exactly true, due to factors like El Niño / La Niña cycles (or Atlantic Niño in Harvey’s case), but probably a good enough approximation that the intuitive interpretation “500-year floods are expected every 500 years” is reasonably true.

Now, it would be fantastically unlikely for Houston to experience three 500-year floods in three years if these were “truly” 500-year floods. A much more likely explanation is that the underlying risk estimate — 0.2% chance of occurring in a year — is based on a bad statistical model. We can use the historical stream gage2 height data from the U.S. Geological Survey (USGS; see also Harris county, TX’s very nice interface) to come up with a better estimate.

Here’s a very high-level outline of what the analysis looks like:

  1. Query the USGS web server to retrieve the last 20 years of gage height data from all stream gage sites in Harris County, TX.
  2. Detrend the data using a one-year rolling median.
  3. Estimate the cumulative distributive function (CDF) for each site’s data using kernel methods. Then locate Harvey in the CDFs to obtain the daily risk estimates
  4. Convert the daily risk estimates to N-year flood estimates by making some assumptions about the form of temporal dependence.
  5. Bootstrap to obtain estimates of uncertainty.

Retrieving the data is not too exciting, so I won’t say much about it in this summary post other than to point out the USGS Daily Values Web Service, which is a really nice system the USGS has set up for retrieving nicely formatted data via HTTP requests. (For those interested in this sort of thing, all the details and syntax are documented in the R Markdown document.) Instead we’ll jump ahead to the part where we have a nice, clean dataset.

Detrending the data

The data are daily mean gage heights for 21 stream gages spread through Harris County, TX. There are more than 21 gage sites in the county, but these are the ones that had mean gage height data available3. Here’s what the data look like at each site:

Figure 1: The raw data.

It mostly looks pretty well-behaved, with the obvious exception being some huge, discontinuous jumps in gage height that appear at some of the sites. The top 6 sites show the biggest jumps ; as you can see, these are all (and the only ones) from Cypress Creek. I have no idea what went down at Cypress Creek to cause these discontinuities, or why the jumps don’t at least all happen at or around the same time.

In principle what we care about in measuring these flood levels are the deviations from the typical stream heights, whatever those typical heights might be at that time, even if they have changed over the years (whether that’s through a slow drift or a big change all at once). In other words, we are interested in the detrended data. Plus, more practically speaking, we want to be able to still use the data prior to those discontinuous jumps in stream height that we saw above. One way or another, that requires accounting for the jumps by detrending them out.

A rolling median is a good candidate method for a nonparametric estimator of the trend line here because (a) it won’t be affected much by big, sudden shocks to the gage height (e.g., from floods), thereby leaving that variability to go into the CDF, which we want; and (b) in theory, a rolling median can model the discontinuous jumps much better than, say, a rolling mean because the median is not a “smooth” statistic like the mean is.

There is the question of the optimal window width for the rolling median (i.e., how many observations to take the median over in each direction). One idea would be to regard this as a hyperparameter to be tuned via cross-validation, but note that in this case minimizing the expected prediction error is not really our goal. Rather, our goal is to remove the biggest, coarsest trends in the data, while leaving all of the variability around that trend line in the data so that it will go into the estimated CDF. The trend line that minimizes the expected prediction error would actually follow the data a bit too closely for our purposes, getting pulled upward by large flood events and thereby eating up some of the variability that we want to keep. In other words, we want the trend line to underfit the data to some extent.

With a little experimentation I found that a one-year window worked pretty well. Here’s an example of what that looks like at a single representative site:

Figure 2: Black lines = data, Red line = rolling median trend line.

Looks great! The trend line remains relatively flat even in the presence of large local floods, but still perfectly captures the weird discontinuous jump in the time series as well as any very slow drifts over time.

Let’s fit one-year moving medians to all of the sites, detrend the data around those median lines, and then examine the resulting detrended data:

Figure 3: The detrended data.

It looks pretty good! There are a handful of odd downward spikes that are probably the result of the median line not capturing the discontinuous jumps as gracefully as possible, but these seem small enough not to worry about.

Estimating the CDFs

This step is pretty straightforward, at least thanks to the sROC package, which has nice functionality for kernel estimation of CDFs. Here’s an example of what the probability density function (PDF) and CDF look like for one representative site, and where the Harvey floods fall in this distribution.

Figure 4: Estimated PDF and CDF for one site.

Okay, so now we can look up the Harvey flood in the CDFs for each site, which will tell us the proportion of the distribution that is less than the Harvey flood. Let one minus this proportion be p_{day}, the probability of a flood at least as great as Harvey on a particular day. Note that it’s for a particular day, not a particular year, because our data are daily gage heights. Now we just need to convert this to p_{year}, the probability for a particular year, and then take 1/p_{year} to obtain the N-year flood estimate. So how do we do the conversion?

Modeling the temporal dependence

If the daily flood events were independent then we could use

p_{year} = 1 - (1 - p_{day})^{365},

which is the probability of at least one “success” from a Binomial distribution with 365 trials. But the daily flood events aren’t really independent: if there’s a flood event on one day, that substantially raises the probability that there will still be a flood event on the next day, because it takes a few days for the water levels to recede to below flood level. For example, here’s what the data look like from the big floods in Houston last year, summer 2016, again from our favorite representative gage site:

Figure 5: Gage data from Summer 2016 flooding near Houston. There are five N-year flood days in a row.

I drew a hypothetical flood height level to represent a particular N-year flood, just for illustrative purposes (I have no idea what flood level this line would actually correspond to). We can see that in this particular case, we had a string of five N-year flood days in a row. More generally, this should make it clear that we expect the daily flood events to be dependent in that they tend to come in clusters. So to convert p_{day} to p_{year} for Harvey, we need a model of this dependence.

A simple model is that the daily flood events show Markov or one-day-back dependence: the probability of a flood event today depends on the presence of a flood event yesterday, but is conditionally independent of flooding 2 days ago given the flood status yesterday. For this model we add a parameter p_1: the conditional probability of a flood event, given that there was a flood event the previous day. Then the conversion formula from above becomes4

p_{year} = 1 - (1 - p_{day})\big[1 - p_{day}(1-p_1)/(1-p_{day})\big]^{364},

which reduces to the previous formula when p_1 =p_{day}, meaning the events are independent.

To get a feel for how this new p_1 parameter affects the model estimates, we can plot the N-year estimates from all 21 sites as a function of the assumed value of p_1 (with p_{day} set to the previously estimated values):

Figure 6: N-year flood estimates each of the 21 sites, as a function of the dependence (p_1) assumptions.

As you can see, the assumed value of p_1 doesn’t have a huge impact until it gets up to around, say, 0.6 or so, but then it starts to matter quite a lot. To think about what plausible values of p_1 might be, it helps to notice that, according to our dependence model, flood durations follow a Geometric distribution with parameter 1-p_1, implying that the expected flood duration is 1/(1-p_1) days. So, for example, when p_1=0.75, then we expect it to take an average of 4 days before water levels recede to below the relevant N-year flood level (but not necessarily back to normal levels).

We’ve already developed a good estimate of p_{day} using the kernel CDFs. To estimate p_1,  basically we can (a) choose a particular flood height, F; (b) categorize the gage heights at any given time point as being either above or below F, giving us H^*_t=1 or H^*_t=0, respectively; and (c) create a contingency table of H^*_t vs. H^*_{t-1} (the H^* value on the previous day), which gives us a table like:

Figure 7

from which we can compute an estimate of p_1 as d/(b+d). Now this estimate, of course, depends on the flood height F that we chose, and it’s certainly possible that there could really be different values of p_1 for different flood heights. So here we plot the estimated values of p_1 for all 21 sites, for F heights ranging from the 90th percentile to whatever percentile corresponds to the Harvey floods at each site (in other words, 1-p_{day} for that site):

Figure 8: Dependence/clustering (p_1) estimates for each of the 21 sites, as a function of the flood height (F) in quantile units. The estimates are predictions from a weighted GAM fit to the simple estimates (in logit-logit space) that result from the procedure illustrated by Figure 7, with weights proportional to the number of data points below F. The right end-points of each curve are the flood quantiles at which Harvey is located for that site; note that these are equal to 1-p_{day}.

Immediately we see that there is considerable variation across gages in their level of dependence, and that the degree of dependence depends on the flood height (in an interesting, non-monotonic way). Because the p_1 estimates depend fairly strongly on the site and on the height of flooding, it wouldn’t be a good idea to use an aggregate p_1 estimate that averages across all sites and/or all flood heights. Instead, for our N-year flood estimates we will use each site’s individual p_1 estimate at the value F corresponding to the Harvey flood height at that site.

We now have point estimates of p_{day} and p_1 for all sites, which means we could technically now compute our N-year flood estimates for each site. But to use only the point estimates would be to ignore the substantial uncertainty associated with these estimates! Furthermore, because we obtain the N-year estimates from a convex transformation of p_{day} and p_1, using only the point estimates would lead us to systematically underestimate the N-year flood levels (due to Jensen’s inequality). So we will use bootstrapping to preserve the variability of the estimates.

Bootstrapping to account for uncertainty

Because our data are time series, a simple nonparametric bootstrap is not a good idea because the temporal structure of the data (such as the autocorrelation) would be destroyed in each bootstrap sample, which would ruin our p_1 estimates (although the p_{day} estimates would probably still be okay). So here we will apply a more specialized bootstrapping method designed to work with time series data, called the stationary bootstrap. Then we will basically treat each bootstrap-resampled distribution as a poor man’s posterior distribution and go from there. Here’s what we get when we run the stationary bootstrap on a single, representative site:

Figure 9

…Huh. The bootstrap results are not totally crazy, but there are clearly a couple of issues. First and most obviously, the bootstrap estimates of p_{day} have come out sort of discretized for some reason—there are only 6 unique values—and with a substantial fraction of the estimates equal to exactly zero (which makes them unusable). Do we really think the posterior distribution for p_{day} truly has this kind of shape? Well, no, probably not. So, if you’ll indulge me in some admittedly ad-hoc adjustment, I’d like to apply a correction to “smooth out” the marginal distribution of the p_{day} estimates. You can think of this as applying a bit of regularization “by hand.”

Now, I’m willing to trust that the mean and standard deviation of this distribution are more or less right; it’s just the shape that I don’t trust. So I’ll apply a two-stage5 transformation that reshapes the empirical distribution into the Beta distribution with the same mean and variance. Note that we “reshape” the existing distribution of values, rather than simply taking new draws from the corresponding Beta distribution, so that we preserve the form of the dependence between the p_{day} and p_1 estimates. So here’s a graphical comparison of the old and new marginal distributions:

Figure 10

Now if we turn our attention to the bootstrap estimates of p_1, we see a less obvious but still somewhat strange issue, which is that the conditional p_1 distribution at p_{day}=0 has a weird bimodal shape—in particular, there is an odd cluster of estimates that are effectively 0. Closer inspection of the p_1 estimates as a function of the flood height F sheds some light on what’s going on here:

Figure 11. Left panel: Dependence/clustering (p_1) estimates as a function of the flood height (F) in quantile units. Green line = Estimates from the actual dataset, Black/gray lines = Estimates from 1000 bootstrap-resampled datasets. Right panel: Density of bootstrap estimates at the uppermost flood quantile (the quantile where the Harvey floods are). Blue/red lines = Estimates from Gaussian mixture model separating the “true” bootstrap distribution from contaminant/divergent estimates.

Recall from the Figure 8 caption that the p_1 estimates are predicted values from a weighted GAM fit (on the logit-logit scale) to the simple estimates resulting from the procedure illustrated Figure 7, because the simple estimates are rough and unstable, particularly at extreme values of the flood height F. Basically what is happening here is that, for a nontrivial minority of the bootstrap-resampled datasets, the simple estimates become empirically undefined (i.e., they involve division by zero) above a certain value of F, because by chance the resampled dataset ended up with no data points near or beyond that value of F. So when that happens, the GAM simply does a linear extrapolation from the highest non-missing F to the F value corresponding to Harvey, and uses the extrapolated value as the bootstrap estimate. That works fine in a lot of cases, but in other cases it can yield wacky/divergent estimates, depending on where along the F sequence the simple estimates become missing.

It seems like a good idea to discard the divergent bootstrap estimates. One way to do this is to fit a mixture model to the bootstrap estimates, use the model to classify each estimate as either probably coming from the “true” bootstrap distribution or probably coming from a (much wider) contaminant distribution, and then to discard the estimates classified as probably being contaminants. That process is illustrated in the right panel of Figure 11, which shows that it works pretty well, at least in the sense that the mixture model seems pretty plausible.

After applying this correction, along with the previously described adjustment to the p_{day} estimates, we end up with an adjusted joint distribution of bootstrap estimates that looks like this:

Figure 12. Compare to Figure 9.

So if we compute the adjusted bootstrap distributions for all 21 sites, then after much ado, that gives us the following final estimates for each site:

Figure 13. Final bootstrap distributions for all sites. Lines = medians, shaded regions = 50% highest posterior density (HPD) intervals.

Conclusion and limitations

As you can see, most of the N-year flood estimates have a lot of uncertainty around them. Estimating the risk of extremely rare events is an inherently hard statistical problem, so it’s not surprising that it’s hard to pin down precise estimates.

Of course, the method I’ve applied here is just one way to approach the problem. Basically it’s the first thing I thought of. I initially intended this to be a rough-and-ready analysis to satisfy my curiosity, but after following the rabbit hole of Markov dependence and bootstrapping complications a bit further than I anticipated, I’m not so sure if the “rough-and-ready” description is still apt.

There is a temptation to try to aggregate in some way the individual N-year flood estimates at each site, in an attempt to estimate the N-year flood level of the Harvey floods as a single, atomic event. After all, media reports of the risk/severity of the Harvey floods just present a single figure like “500 year flood” or “1000 year flood,” not broken down by site. This kind of aggregation is problematic in at least a couple ways. First, on a more conceptual level, it’s not clear which sites should be aggregated over. A simple approach might be to simply aggregate over all the sites in Harris county, which is sort of what I did (I mean, I avoided aggregation, but I did simply analyze all the sites in Harris county). According to my girlfriend—who, unlike me, actually does know something about hydrology—what would probably make more sense would be to aggregate over all the sites within a particular watershed, ignoring geologically arbitrary county lines. Second, on a more methodological level, the statistical problem of estimating the probability of a rare event across the entire constellation of 21 sites is quite formidable. For example, a simple extension of my method would involve first estimating a 21-dimensional joint density of the flood heights at all gage sites… the thought of which is enough to make one cringe. (I actually spent quite a while trying to get this to work, but ultimately just didn’t trust the answers I was getting, so I decided not to post any of that here.)

This method I used here does not take into account the fairly obvious possibility that the yearly risk of Harvey-level flood events may not be constant, but instead may be increasing over time. Including such time-dependence in the model would make this already difficult statistical problem even more challenging. This increasing-risk hypothesis is certainly plausible, but is quite complicated to think about because it depends on a lot of factors, such as the rate and nature of current/future urban development in the Houston area, preparatory measures that may be taken up by Houston, the possibly increasing frequency of large hurricanes due to climate change, etc. Regarding this last point, it seems that the evidence for an increasing frequency of Category 4/5 hurricanes is rather mixed.

Finally, there is some reason to think that almost any method applied immediately in the wake of Hurricane Harvey will lead to some degree of overestimation of the probability of similar events in the future (or, equivalently, N-year flood estimates that are too small). This point is actually quite subtle, but bear with me. Hypothetically, across many possible timelines, if we always waited for a particular event (such as a flood of a certain size) to occur and then estimated the probability of that event immediately after it occurred, our probability estimates would end up inflated to some extent (on average across the timelines) due to a form of selection bias that seems related to the recently renewed discussion of hot hand effects. For a simplified model of how this selection bias would work, consider a situation where we are observing a sequence of tosses of an unfair coin that comes up heads (H) with probability h, and we construct an estimate \hat{h} by waiting until we observe the first H and then computing the proportion of H outcomes in the sequence. Then

\text{E}[\hat{h}] = \sum_{k=1}^{\infty}G(k,h)/k,

where G() is the probability mass function for a Geometric distribution. On average, this exceeds h when h is less than about 0.35 or so, and by quite a lot when H is rare. For example, if h=.01, then \text{E}[\hat{h}] \approx .046, so that the estimates overstate the probability of heads by almost a factor of 5 on average. This is not exactly what’s going on in our Harvey analysis, but it’s quite similar. It’s hard to know what to do (if anything) about this kind of selection bias, but it’s interesting to think about.


1 More specifically, that floods follow a Bernoulli process over years or a Poisson process in continuous time

2 The USGS consistently uses the spelling “gage” instead of the more familiar (to me) “gauge.” Your guess as to why is as good as mine. Anyway, I’ll follow the USGS’s spelling convention.

3 Note that I use the daily means here, even though arguably it might make more sense to use the daily maximums. But unfortunately the daily maximums are only available for 3 of the gages in Harris county.

4 See the R Markdown document for the derivation of this.

5 The first step of the transformation—transforming the observed distribution to a Uniform distribution—is achieved in this case by taking X' = \text{rank}(X)/(n+1), where n is the number of X values, and breaking ties randomly so that the 6 unique values become a smooth distribution.

Using causal graphs to understand missingness and how to deal with it

If you’re reading this, you probably know that missing data can cause a lot of problems in a data analysis, from reduced efficiency at best to seriously mistaken conclusions at worst. You may even be able to recite the technical definitions of Rubin’s three types of missing data. Those definitions are all well and good, but if you’re like me, you don’t necessarily have such an easy time applying that knowledge to concrete data situations and determining how to proceed.

Well, I just discovered a new tool for reasoning about missingness that I’m pretty excited about, and I’d like to share it with you. Basically the idea is to use causal graphs to graphically represent our assumptions about the patterns of causality among the observed variables and their missingness mechanisms. And then we can visually follow simple path-tracing algorithms along the graph to answer questions like

  • What parameters (such as regression coefficients or population means) can I recover—meaning construct estimates that are consistent with what we would get with no missing data—by just analyzing the non-missing cases (i.e., listwise deletion)? This is called ignorable missingness.
  • For missingness that is ignorable, which auxiliary variables should I condition on (e.g., statistically adjust for or include in an imputation model), and which should I actually avoid conditioning on?

This approach works so well because causal graphs are all about deducing which observed variables should be conditionally independent given which other variables—and it so happens that Rubin’s theory of missing data is naturally phrased in terms of conditional independencies.

Directed Acyclic Graphs (DAGs)

There’s a lot to say about DAGs… frankly too much to fit in one tutorial-style blog post. For a crash course in what they are and how the work, I’d recommend the resources listed here. In the interest of keeping this post as concise as possible, I’ll assume you either have some prior familiarity with DAGs or that you’ve perused some of the crash course material I just referenced, but I will at least provide brief reminders about key concepts as they are needed.

The big idea

The big idea is to augment the usual DAGs by adding nodes that represent the missingness in any variables that contain missing values, creating a so-called m-graph. For example, here’s a DAG for a situation where we have a predictor of interest X, an outcome Y, and two concomitant variables W and Z:

Figure 1: DAG without missingness

The causal model underlying this DAG assumes that

  • X directly causes Y and Z
  • Z and Y are associated through an unobserved confounder1
  • W causes Y independently of X.

Now we suppose that Y and X contain missing values, so we add corresponding missingness nodes R_Y and R_X. Specifically, we create the following m-graph:

Figure 2: An m-graph

What are we saying about the missingness in Y and X when we posit this causal model? We are saying that

  • The missingness in X is directly caused by the (partially observed) values of X itself. An example of this would be if high-income respondents to a survey were less likely to disclose their incomes specifically because they didn’t want their high income to be known. In other words, high values are missing because they are high values. In Rubin’s terminology, X is missing not at random (MNAR).
  • The missingness in Y is statistically associated with the values of Y, but this association is entirely due to the common cause W. An example of this would be if a high level of education (W) both causes income (Y) and causes respondents to (for some reason) be less likely report their income (R_Y). In Rubin’s terminology, Y is missing at random (MAR) given W.2

Now that we’ve laid out our assumptions about how missingness is related to the relevant observed and unobserved variables, we can apply relatively simple graphical criteria to help answer the kinds of questions I laid out near the beginning of this post. These criteria are due to the amazing work of Karthika Mohan and colleagues (see References). Here I’m only going to focus on the conditions for recovering regression coefficients, but this research also lays out criteria for recovering the full joint, conditional, or marginal distributions of all variables in the graph.

Conditions for recovering regression coefficients

We have separate necessary or sufficient conditions for recovering the coefficients from a regression predicting Y. A necessary condition is:

  1. Y (the outcome) cannot directly cause R_Y (or vice versa, although that would be strange).

We can see in the m-graph that Condition 1 is satisfied. Although Y and R_Y are statistically related (through their common cause W), Y does not directly cause R_Y. The predictor X does directly cause its missingness node R_X, but this doesn’t matter for estimating a regression of Y on X. To give an intuition about why we can have MNAR in X but not Y, consider this figure from Daniel et al. (2012):

Figure 3: We can have MNAR in the predictor A, but not the outcome Y. (From Daniel et al., 2012)

For simplicity, the figure considers a crude missing data mechanism where the observation is simply removed if it exceeds some fixed value (i.e., truncated). In the top panel (a), the missingness depends directly on the value of the predictor A. Despite this, we can still recover the slope simply by analyzing the complete cases. In the bottom panel (b), the missingness depends directly on the value of the outcome Y. As the figure shows, this has a distorting influence on the slope which precludes the consistent recovery of the regression coefficient.

Meeting the necessary Condition 1 alone does not guarantee that we can recover the regression coefficient of interest; it just says that it might be possible. A sufficient condition for recoverability of the regression of Y on p predictors X_1, X_2, \ldots, X_p is:

  1. Y is d-separated from the missingness nodes R_Y, R_{X1}, R_{X2}, \ldots, R_{Xp} by the predictors X_1, X_2, \ldots, X_p.

Recall that two nodes A and B are d-separated by C if conditioning on C blocks all open paths between A and B (keeping in mind that colliders act “in reverse”: colliders block open paths unless they are conditioned on). If Condition 2 is satisfied, then we can recover the regression of Y on X_1, X_2, \ldots, X_p simply by analyzing the non-missing observations at hand.

Glancing at the m-graph in Figure 2 we can see that the regression of Y on X does not meet Condition 2, because Y and R_Y remain d-connected through W, which is not in the set of predictors.

So basically we have two options for how to proceed. The first and easier option is to consider instead a regression of Y on both X and W. Essentially, we decide that rather than seeking \beta_{YX}, the simple regression coefficient of Y on X, we will settle instead for \beta_{YX.W}, the partial regression coefficient that adjusts for W. In that case, Condition 2 is satisfied because this new predictor W d-separates Y from R_Y. So we could recover the coefficients from this new multiple regression simply by analyzing the non-missing observations at hand.

The second option is for when we really do want \beta_{YX} and not \beta_{YX.W}. In the graph from Figures 1 and 2, there’s not really any good reason for preferring one over the other, since the graph implies that X and W are independent in the population and thus \beta_{YX} = \beta_{YX.W}.3 So in that case we may as well just estimate \beta_{YX.W}, since that would satisfy Condition 2. But what if the situation were slightly different, say, with X being a cause of W?

Figure 4: W is an effect of X.

In this new graph, the path X \rightarrow W \rightarrow Y is part of the total causal effect of X on Y, so we don’t want to condition on W. In this case, instead of Condition 2 we consider a new condition that allows for a set of q auxiliary variables A_1, A_2, \ldots, A_q that help d-separate Y from the relevant missingness nodes, but that are not conditioned on in the regression of Y on the predictors X_1, X_2, \ldots, X_p. The (sufficient) condition is:

  1. Y is d-separated from the missingness nodes R_Y, R_{X1}, R_{X2}, \ldots, R_{Xp}, R_{A1}, R_{A2}, \ldots, R_{Aq} by the set of predictors and auxiliary variables X_1, X_2, \ldots, X_p, A_1, A_2, \ldots, A_q. Furthermore, for any auxiliary variable A_i that contains missing values, we can recover the coefficient from regressing A_i on the predictors X_1, X_2, \ldots, X_p by Condition 2 or by a recursive application of Condition 3.

If Condition 3 is satisfied, then \beta_{YX} is recoverable by constructing an estimate from (a) the regression of Y on the predictors and auxiliary variables and (b) the regressions of each auxiliary variable on the predictors. For example, in the simple case where the structural equation for Y is linear and additive in X and W, we can use the well-known fact that

\underbrace{\beta_{YX}}_{\text{total effect}} = \underbrace{\beta_{YX.W}}_{\text{direct effect}} + \underbrace{\beta_{WX}\beta_{YW.X}}_{\text{indirect effect}}

In the present case, conditioning on the auxiliary variable W satisfies Condition 3 because (a) W d-separates Y from R_Y, and (b) W contains no missing values.

Necessity, sufficiency, and functional forms

So now let’s return to the m-graph in Figure 2, where X does not cause W. We see that if we do not condition on W by adjusting for it in the regression model, then we still meet the necessary condition for recoverability (Condition 1), but we fail to meet the sufficient conditions for recoverability (Conditions 2 and 3). So we fall in a sort of in-between case… is the estimate of \beta_{YX} recoverable or not? (Note that we saw above that it is recoverable if we apply the procedure dictated by Condition 3, but here we are asking if we can “directly” recover the estimate just from the simple regression of Y on X among the non-missing cases.)

It’s impossible to have complete necessary-and-sufficient conditions in this situation because \beta_{YX} may or may not be directly recoverable, depending on the functional forms of the structural equations. Remember that a DAG is totally nonparametric in that the structural equations it specifies can have any functional form and the variables can have any distributions (within reason). So for the DAG in Figures 1 and 2, the regression of Y on X and W could be a linear, additive function of X and W, or an interactive/multiplicative function, or any crazy nonlinear function. It turns out that, in this particular case, \beta_{YX} is directly recoverable if the effects of X and W on Y are linear and additive, but is not directly recoverable if the regression involves an X \times W interaction. This figure should help to give some intuition about why that’s the case:

Figure 5: The additive and interactive models have the same DAG, but \beta_{YX} is only directly recoverable in the additive model. The W=0 group (black) has no missing data, while the W=1 group (red) has 50% missing data. Open circles represent missing data points.

For simplicity, Figure 5 considers an example where the auxiliary variable W is a binary dummy variable. In the additive example, simply analyzing the non-missing cases yields the same estimate (on average) as if all the data were fully observed—whether or not we condition on W. In the interactive example, the group with more missing data has a greater slope, so analyzing the non-missing cases skews the total estimate toward the group with the smaller slope. But by conditioning on W, we can either (a) recover the conditional effect of X on Y given W (through Condition 2) simply by analyzing the non-missing cases, or (b) recover the marginal effect of X on Y (through Condition 3) by taking a weighted mean of the regression lines at W=0 and W=1, with the weights proportional to the number of cases in each group prior to missingness (which we know, despite the missingness, since the missing values still have partial records on the other observed variables).

Concluding remarks

While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic. It’s often assumed that having data missing not at random (MNAR) basically trashes your analysis if a non-trivial fraction of data are missing, which is frequently true in observational data. But the necessary and sufficient conditions laid out here suggest that, in fact, the simple strategy of listwise deletion—simply “ignoring” the missingness and analyzing the non-missing observations that are at hand—yields robust estimates under a pretty wide range of missingness mechanisms. And even in a lot of cases where you can’t just ignore the missing data problem, you can often still construct a consistent estimate of your parameter of interest (as in Condition 3) without needing to use fancy procedures like multiple imputation or distributional modeling of the predictors. Even further, you may still be able to recover the parameter estimates of interest even if you can’t satisfy the sufficient conditions given here (as in the final example above). Finally, and more anecdotally, simulations I played around with while preparing this post suggest that, even when the regression coefficient of interest is not technically recoverable, the magnitude of bias is often small under many realistic conditions (see, for example, the magnitude of bias in the simulations of Thoemmes et al., 2015).

Footnotes and References

1 Recall that in DAGs, doubled-headed arrows like A \leftrightarrow B are just shorthand notation for A \leftarrow L \rightarrow B, indicating an unobserved common cause L.

2 Rubin’s third type of missingness, missing completely at random (MCAR), would be represented graphically by the relevant missingness node R being completely disconnected from all other nodes in the graph. An example of this would be if we flipped a coin for each observation and deleted the corresponding value when the coin landed heads. Although commonly assumed for convenience, MCAR is not actually very common in practice unless it is deliberately built into the study design.

3 Actually, while this is true in the special case of classical regression, it’s not true in general. See this later blog post of mine that discusses the fact that this property doesn’t hold in logistic regression.

Daniel, R. M., Kenward, M. G., Cousens, S. N., & De Stavola, B. L. (2012). Using causal diagrams to guide analysis in missing data problemsStatistical methods in medical research21(3), 243-256.

Mohan, K., Pearl, J., & Tian, J. (2013). Graphical models for inference with missing data. In Advances in neural information processing systems (pp. 1277-1285).

Mohan, K., & Pearl, J. (2014). Graphical models for recovering probabilistic and causal queries from missing data. In Advances in Neural Information Processing Systems (pp. 1520-1528).

Thoemmes, F., & Rose, N. (2014). A cautious note on auxiliary variables that can increase bias in missing data problemsMultivariate behavioral research49(5), 443-459.

Thoemmes, F., & Mohan, K. (2015). Graphical representation of missing data problemsStructural Equation Modeling: A Multidisciplinary Journal22(4), 631-642.

Designing multi-lab replication projects: Number of labs matters more than number of participants

In a multi-lab replication project, multiple teams of investigators team up to all run the same study (or studies) concurrently at different research sites. The best examples of this in psychology are the various Many Labs projects. There are lots of reasons why multi-lab replication projects are great. For example, they allow us to estimate and potentially model any between-site variability in the effect size, so we learn more about the generality of the effect. Another reason is that they can have greater statistical power than single-lab studies — as long as they involve a large enough sample size of labs. The point of this blog post is to underscore this point about the number of labs needed to ensure high statistical power.

The verbal (intuitive?) explanation

We’re used to thinking about the number of participants as being, apart from the effect size, the chief factor in determining the statistical power of a study. And most of the time, in the kinds of single-lab studies we tend to run, this is basically true. Other factors matter as well — for example, the number of times we observe each participant, and the proportion of observations in each cell of the design — but as long as these other factors are within a typically sane range, they tend not to matter as much as the number of participants.

So it is perhaps natural that we apply this same heuristic to multi-lab replication projects. We reason that even if each lab can only recruit, say, 100 participants, and even if we can only recruit, say, 5 labs, this will give us a total of 500 participants, so statistical power will still be quite high! Right?

But here’s the thing. The reason the number of participants has such a big impact on power in the single-lab case is that, in those cases, the participants are the highest-level units or clusters in the design. That is to say, we can potentially have multiple observations of each participant — for example, we collect 50 reaction times from each participant — but these observations are clustered within participants, which are the high-level units. It turns out that the proper generalization of the “number of participants is important” heuristic is, in fact, “the number of highest-level units is important — the lower-level units, not as much.”

So now consider a multi-lab replication project. Here, the participants are clustered in labs. So the labs are the highest-level units. Remember the earlier example about having a study with 5 labs, each with 100 participants? In terms of its statistical power, this would be about like running a single-lab study with 5 participants, each of whom contributes 100 reaction times. In other words, it wouldn’t be great.

The quantitative view

Let’s look at some actual power results. We consider a simple multi-lab design where we have m labs, each of which recruits n participants that are divided into two conditions (n/2 participants per condition), and we observe each participant only a single time. In other words, we have a simple two-group between-subject experiment that is being replicated at m different labs, and the labs have random effects. The key quantity for determining the power of the study is \delta, the noncentrality parameter (for a non-central t distribution). It looks like this:

\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}

where d is the effect size (Cohen’s d), E is the proportion of random variation due to error variance (i.e., the ratio of the Error variance over the [weighted] sum of all the variance components), and L is the proportion of random variation due to Lab variance (actually, it’s the proportion of Lab-by-Condition interaction variance, but I’m calling it Lab variance for short). Statistical power is pretty much determined by the noncentrality parameter — there’s technically also some influence of the degrees of freedom, but that tends not to matter much as long as it is above, say, 20 or so. So basically, we can understand the power of this multi-lab replication design by considering this noncentrality parameter expression.

First, let’s just plug a range of plausible values into the variables comprising \delta and see what values of statistical power they imply. Here’s a series of contour plots where we vary the variables within plausible ranges.

Statistical power of the multi-lab replication design as a function of m, n, d, and L.
Statistical power of the multi-lab replication design as a function of m, n, d, and L. The ranges of values for m and n probably don’t need any additional justification. For the range of Cohen’s d effect sizes, see this earlier blog post. The proportion of Error variance is always fixed at E = 50%, which in my informed opinion is a plausible value, but basically E doesn’t usually have much impact on power anyway, so the exact value is not too important. The range of values for L, the proportion of Lab variance, is much more interesting — as you can see, this actually has a big impact on power, so it’s important that our assumed values of L are reasonable. I have assumed that a plausible range is about from 1% to 10%, with the most plausible value around 5% or so. The justification for this is rather involved, so I wrote up a separate little document about it HERE. R code to reproduce this figure can be found HERE.

The middle panel represents what I think are the most plausible values. There are a couple of interesting things to point out about these power results. The first is that…

Increasing the number of labs usually raises power more quickly than increasing the number of participants per lab

The way to see this in the graphs is to consider the angle of the contours in each plot. More specifically, for any given point (i.e., pair of sample sizes) in any of the plots, consider the direction in which we would want to step to increase power fastest. For most parameter combinations, the path of steepest ascent up the power surface goes up along the y-axis (number of labs) more than it goes sideways along the x-axis (participants per lab). This is especially true when there is a lot of Lab variance (L=10\%, in the right-hand column), but is still usually true when there is little Lab variance (L=1\%, in the left-hand column).

There is another way of visualizing this that makes this point more clear. It uses the idea of indifference curves — technically speaking, curves where the rate of change in the noncentrality parameter w.r.t. the number of labs is equal to the rate of change w.r.t. participants per lab. The way these indifference curves are plotted below, studies that are below the relevant indifference curve would get a greater power benefit from increasing the number of labs, and studies that are above the relevant indifference curve would get a greater power benefit from increasing the number of participants per lab. For studies that lie exactly on the indifference curve, power would increase just as fast by increasing the number of labs as by increasing the number of participants per lab.

Proportion of Error variance is fixed at E = 50%. The indifference curves do not depend on the effect size d... yay!
Proportion of Error variance is fixed at E = 50%. The indifference curves do not depend on the effect size d… yay!

As you can see, most of the time there is a greater benefit (i.e., statistical power will increase faster) to increasing the number of labs. This is especially true if there is a lot of Lab variance. But it tends to be true even when there is little Lab variance. The cases where it makes more sense to increase the number of participants per lab are when you already have a large number of labs but a small number of participants per lab. And let’s face it, your multi-lab replication project is probably not in this part of the space.

Increasing the number of participants per lab — but holding constant the number of labs — will not, in general, cause statistical power to approach 100%

This one is important. We tend to assume that as long as we continue to recruit participants, eventually we will have high statistical power. So even if we didn’t recruit as many labs for our project as we had hoped, we should be able to compensate for this by just recruiting more participants per lab — right? Unfortunately it isn’t so. The truth is that if we hold constant the number of labs, statistical power does not approach 100% as the number of participants per lab approaches infinity. Instead, power approaches some maximum attainable power value that can possibly be quite small, depending on the effect size, number of labs, and the Lab variance.

This can actually be seen pretty easily by considering again the expression of the noncentrality parameter:

\delta = \frac{d}{2\sqrt{\frac{E}{mn} + \frac{L}{m}}}

In the limit as n approaches infinity, the term in the denominator involving E disappears, but the term involving L does not, so the whole noncentrality parameter converges to a finite and possibly small value. Here’s what the situation looks like graphically for a handful of representative values of the variables:

Effect size d = 0.4. Proportion of Error variance E = 50%.  In the left panel, L is 1%, 5%, or 10%, and n is 16, 32, or 64. In the right panel, L is 1%, 5%, or 10%, and m is 4, 8, or 16.

The curves in each panel are unlabeled because, honestly, they don’t really matter. (If you want to know the power values for specific parameter combinations, the first graph is better for that anyway.) The point is just to show that when we increase the number of labs, power does eventually approach 100%, and the values of the other variables simply affect how quickly this happens. But when we increase the number of participants per lab — but, crucially, hold constant the number of labs — power sometimes approaches a value close to 100%, but often does not. The maximum attainable power will tend to be low when (a) the effect size is small, (b) the number of labs is small, (c) the Lab variance is high.


The main conclusion is pretty simple. Multi-lab replication projects are great, and you should do them… but when you’re designing them, you should really try hard to recruit as many labs to collaborate in the project as you possibly can. You actually don’t need that many participants from each lab — the first figure shows that unless the Lab variance is tiny, you get quite diminished returns by recruiting more than 100 or so participants per lab — so perhaps this is a selling point that you can use to convince more labs to join your project (“we would only need you to run a few dozen participants!”).

If you want to do your own power analyses for multi-lab designs like this, or for other more complicated designs, you can use my PANGEA app.

Five different “Cohen’s d” statistics for within-subject designs

Jeff Rouder poses an “effect size puzzler” where the puzzle is simply to compute a standardized effect size for a simulated dataset where subjects make 50 responses in each of 2 conditions. He offers a sweatshirt prize (???) to anyone who can do so “correctly.”

(Update: Jeff posted his answer to the effect size puzzler here.)

As it happens, I’ve been meaning for a while to write a blog post on issues with computing d-like effect sizes (or other standardized effect sizes) for within-subject designs, so now seems like a good time to finally hammer out the post.

Jeff didn’t actually say anything to restrict us to standardized mean difference type measures (as opposed to, say, variance explained type measures), and we can only guess whether the “correct” effect size he has in mind is a d-like measure or an R^2-like measure or what. But here I’ll focus on d-like measures, which are perhaps the most popular for studies with categorical predictors, and offer plenty of complications that I think are often under-appreciated (and it seems Jeff would agree).

What I’ll show here is that there are at least 5 different and non-equivalent ways that people might compute a d-like effect size (which they would invariably simply call “Cohen’s d”) for Jeff’s dataset, and the resulting effect sizes range from about 0.25 to 1.91. I’ll compare and contrast these procedures and ultimately choose one that I think is the least crazy, if you must compute a standardized effect size (more on that later). When I read a paper that reports “Cohen’s d” for a within-subject design, I usually have no idea which of these 5 procedures the authors actually applied unless I try to reproduce the effect size myself from some of the descriptives in the paper, which is often not possible.

Jeff’s dataset can be downloaded here.  Here’s some code to load it into R, examine it, print the two conditions means, and save the mean difference as

d: Classical Cohen’s d

This is essentially a straightforward application of the formula provided by Cohen himself:


The numerator is the mean difference. Cohen never actually gives a clear, unambiguous definition of \sigma in the denominator, but it is usually taken to be the pooled standard deviation, that is, the square root of the average variance in each condition (assuming equal sample sizes in both conditions). In R code:

(d <- md / with(dat, sqrt(mean(tapply(rt, cond, var)))))
# 0.2497971

The crucial thing to recognize about applying the classical Cohen’s d is that it deliberately ignores information about the design of the study. That is, you compute d the same way whether you are dealing with a between-subjects, within-subjects, or mixed design. Basically, in computing d, you always treat the data as if it came from a simple two-independent-groups design. I don’t want to get bogged down by a discussion of why that is a good thing at this point in the post—I’ll get into that later on. For now I just note that, with this effect size, within-subject designs tend to be powerful not because they lead to larger effect sizes—if anything, the reverse is probably true, in that people elect to use within-subject designs when Cohen’s d is particularly small, for example in many reaction time studies—but rather because they allow us to efficiently detect smaller effect sizes due to removing irrelevant sources of variation from the denominator of the test statistic.

da: Cohen’s d after averaging over replicates

For this next way of computing a d-like effect size, the section heading pretty much says it all. In Jeff’s dataset, each participant makes a total of 100 responses, 50 in each condition. The computation of d_a proceeds by first averaging over the 50 responses in each subject-by-condition cell of the experiment, so that each subject’s data is reduced to 2 means, and then applying the classical d formula to this aggregated data. In R code:

sub_means <- with(dat, tapply(rt, list(id, cond), mean))
(d_a <- md / sqrt(mean(diag(var(sub_means)))))
# 0.8357347

Richard Morey blogged about some weird things that can happen when you compute a standardized effect size this way. The basic issue is that, unlike classical Cohen’s d, d_a does not ignore all the design information: d_a will tend to be larger when there are more replicates, that is, when each subject responds more frequently in each condition.

dz: Standardized difference scores

A third way to compute a d-like effect size is to reduce each subject’s data to a single difference score—the mean difference between their responses in each condition—and then use the standard deviation of these difference scores as the denominator of d. Cohen actually discusses this statistic in his power analysis textbook (Cohen, 1988, p. 48), where he carefully distinguishes it from the classical Cohen’s d by calling it d_z. In R, we can compute this as:

(d_z <- md / sd(sub_means[,2] - sub_means[,1]))
# 1.353713

There is a straightforward relationship between d_z and the test statistic: t_w = d_z\sqrt{n}, where t_w is the paired-samples t-statistic from a within-subjects design and n is the number of subjects. One might regard this as a virtue of d_z. I will argue below that I don’t think it’s a good idea to use d_z.

dt: Naive conversion from t-statistic

In a simple two-independent groups design, one can compute the classical Cohen’s d from the t-statistic using

d=t_b\sqrt{\frac{2}{n}} ,

where t_b is the independent-samples t-statistic for a between-subjects design and n is the number of subjects per group. Many, many authors over the years have incorrectly assumed that this same conversion formula will yield sensible results for other designs as well, such as in the present within-subjects case. Dunlap et al. (1996) wrote a whole paper about this issue. One might suppose that applying this conversion formula to t_w will yield d_z, but we can see that this is not the case by solving the equation given in the previous section for d_z, which yields d_z=\frac{t_w}{\sqrt{n}}. In other words, naive application of the between-subjects conversion formula yields an effect size that is off by a factor of \sqrt{2}.

To compute d_t in R:

t_stat <- t.test(sub_means[,2] - sub_means[,1])$statistic
(d_t <- t_stat*sqrt(2/nrow(sub_means)))
# 1.914439

dr: Residual standard deviation in the denominator

Finally, one can compute a d-like effect size for this within-subject design by assuming that the \sigma in the classical Cohen’s d formula refers to the standard deviation of the residuals. This is the approach taken in Rouder et al. (2012) on Bayes factors for ANOVA designs. In between-subjects designs where each subject contributes a single response, this is equivalent to classical Cohen’s d. But it differs from classical Cohen’s d in designs where subjects contribute multiple responses.

To compute d_r in R, we need an estimate of the residual standard deviation. Two ways to obtain this are from a full ANOVA decomposition of the data or by fitting a linear mixed model to the data. Here I do it the mixed model way:

mod <- lmer(rt ~ cond + (cond||id), data=dat)
# Random effects:
#  Groups   Name        Variance  Std.Dev.
#  id       (Intercept) 0.0026469 0.05145 
#  id.1     cond        0.0001887 0.01374 
#  Residual             0.0407839 0.20195 
# Number of obs: 2500, groups:  id, 25
# Fixed effects:
#             Estimate Std. Error t value
# (Intercept) 0.779958   0.016402   47.55
# cond        0.052298   0.008532    6.13

The residual standard deviation is estimated as 0.20195, which gives us d_r0.259. It turns out that, for this dataset, this is quite close to the classical Cohen’s d, which was 0.25. Basically, classical Cohen’s d is equivalent to using the square root of the sum of all the variance components in the denominator1,2, rather than just the square root of the residual variance as d_r uses. For this simulated dataset, the two additional variance components (intercepts and slopes varying randomly across subjects) are quite small compared to the residual variance, so adding them to the denominator of the effect size does not change it much. But the important thing to note is that for other datasets, it is possible that d and d_r could differ dramatically.

So which one should I compute?

Contrary to Jeff, I don’t really think there’s a “correct” answer here. (Well, maybe we can say that it’s hard to see any justification for computing d_t.) As I put it in the comments to an earlier blog post:

Basically all standardized effect sizes are just made-up quantities that we use because we think they have more sensible and desirable properties for certain purposes than the unstandardized effects. For a given unstandardized effect, there are any number of ways we could “standardize” that effect, and the only real basis we have for choosing among these different effect size definitions is in choosing the one that has the most sensible derivation and the most desirable properties relative to other candidates.

I believe that classical Cohen’s d is the option that makes the most sense among these candidates. Indeed, in my dissertation I proposed a general definition of d that I claim is the most natural generalization of Cohen’s d to the class of general ANOVA designs, and I considered it very important that it reduce to the classical Cohen’s d for datasets like Jeff’s. My reasoning is this. One of the primary motivations for using standardized effect sizes at all is so that we can try to meaningfully compare effects from different studies, including studies that might use different designs. But all of the effect size candidates other than classical Cohen’s d are affected by the experimental design; that is, the “same” effect will have a larger or smaller effect size based on whether we used a between- or within-subjects design, how many responses we required each subject to make, and so on. Precisely because of this, we cannot meaningfully compare these effect sizes across different experimental designs. Because classical Cohen’s d deliberately ignores design information, it is at least in-principle possible to compare effect sizes across different designs. Morris and DeShon (2002) is a nice paper that talks about these issues. Bakeman (2005) also has some great discussion of essentially this same issue, focused instead on “variance explained”-type effect sizes.

Although I don’t really want to say that there’s a “correct” answer about which effect size to use, I will say that if you choose to compute d_zd_r, or anything other than classical Cohen’s d, just please do not call it Cohen’s d. If you think these other effect sizes are useful, fine, but they are not the d statistic defined by Cohen! This kind of mislabeling is how we’ve ended up with 5 different ways of computing “Cohen’s d” for within-subjects designs.

Finally, there is a serious discussion to be had about whether it is a good idea to routinely summarize results in terms of Cohen’s d or other standardized effect sizes at all, even in less complicated cases such as simple between-subjects designs. Thom Baguley has a nice paper with some thoughtful criticism of standardized effect sizes, and Jan Vanhove has written a couple of nice blog posts about it. Even Tukey seemed dissatisfied with the enterprise. In my opinion, standardized effect sizes are generally a bad idea for data summary and meta-analytic purposes. It’s hard to imagine a cumulative science built on standardized effect sizes, rather than on effects expressed in terms of psychologically meaningful units. With that said, I do think standardized effect sizes can be useful for doing power analysis or for defining reasonably informative priors when you don’t have previous experimental data.


1 In general, this is really a weighted sum where the variance components due to random slopes must be multiplied by a term that depends on the contrast codes that were used. Because I used contrast codes of -1 and +1, it works out to simply be the sum of the variance components here, which is precisely why I changed the default contrasts before fitting the mixed model. But be aware that for other contrast codes, it won’t simply be the sum of the variance components. For more info, see pp. 20-21 of my dissertation.

2 If you actually compute classical Cohen’s d using the square root of the sum of the estimated variance components, you will find that there is a very slight numerical difference between this and the way we computed d in the earlier section (0.2498 vs. 0.2504). These two computations are technically slightly different, although they estimate the same quantity and should be asymptotically equivalent. In practice the numerical differences are negligible, and it is usually easier to compute d the previous way, that is, without having to fit a mixed model.

Reading list: Introduction to linear mixed models for cognitive scientists

(Note: This is one of three posts that I wrote some time ago that have just been languishing under the “Misc.” tab of my website for a while, because for whatever reason I didn’t feel that they were a good fit for my blog. Well, I’ve decided to go ahead and move them to the blog, so here you go!)

List last updated: June 20, 2015

Baayen, R. H. (2008). Mixed models. Chapter 7 in Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of memory and language, 59(4), 390-412.

Brysbaert, M. (2007). The language-as-fixed-effect-fallacy: Some simple SPSS solutions to a complex problem. London: Royal Holloway, University of London.

Carson, R. J., & Beeson, C. M. (2013). Crossing language barriers: Using crossed random effects modelling in psycholinguistics research. Tutorials Quant Meth Psych, 9(1), 25-41.

Hoffman, L., & Rovine, M. J. (2007). Multilevel models for the experimental psychologist: Foundations and illustrative examples. Behavior Research Methods, 39(1), 101-117.

Janssen, D. P. (2012). Twice random, once mixed: Applying mixed models to simultaneously analyze random effects of language and participants.Behavior Research Methods, 44(1), 232-247.

Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problemJournal of Personality and Social Psychology, 103(1), 54-69.

Locker, L., Hoffman, L., & Bovaird, J. A. (2007). On the use of multilevel modeling as an alternative to items analysis in psycholinguistic research.Behavior Research Methods, 39(4), 723-730.

Quené, H., & Van den Bergh, H. (2004). On multi-level modeling of data from repeated measures designs: A tutorial. Speech Communication, 43(1), 103-121.

Quené, H., & Van den Bergh, H. (2008). Examples of mixed-effects modeling with crossed random effects and with binomial data. Journal of Memory and Language, 59(4), 413-425.

Richter, T. (2006). What is wrong with ANOVA and multiple regression? Analyzing sentence reading times with hierarchical linear models. Discourse Processes, 41(3), 221-250.

Sorensen, T., & Vasishth, S. (2015). Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists.

Winter, B. (2014). A very basic tutorial for performing linear mixed effects analyses. arXiv preprint arXiv:1308.5499.

Don’t fight the power (analysis)

Public Enemy are famously skeptical of power analysis.
Public Enemy are famously skeptical of power analysis.

Researchers often feel uneasy about using power analysis to design their actual experiments because of uncertainty about the effect size in the study to be run. A common sentiment that one hears goes something like:

“I can’t do a power analysis because I have no idea what the effect size is. If I knew the effect size, I wouldn't have to run the study in the first place!”

The implication of this view is that, unless one has actually done experiments in the past that are pretty similar to the one being considered, there is otherwise no justifiable basis for making any particular assumptions about the effect size in the present study. In order to have a good idea about the effect size, the argument goes, we have to actually run the study, at which point obviously the power analysis is no longer needed. Convinced by this reasoning, many researchers throw up their hands, decide that power analysis will not be useful here or perhaps ever, and just plan instead on collecting some loosely conventional sample size that depends on their research area, but is usually something like 20-30 observations per cell of the design. In other words, they fight the power.

I’m here to convince you that fighting the power is a self-defeating research habit.

You know more than you think before the study

The first premise of the argument against power analysis is that we know little or nothing about the effect size before the study has been run. On the contrary. In the year 2015 we can benefit from decades of meta-analyses that have summarized the typical effect sizes found in almost any imaginable corner of the research literature. We even have meta-meta-analyses of those meta-analyses. The effect size in your future study is likely to resemble the effect sizes of the past, and luckily for us, the meta-analytic data on typical effect sizes are vast.

I want to illustrate just how good our situation really is by considering what is probably our worst case scenario in terms of study design: The case where we know absolutely nothing whatsoever about the study to be run except that the topic matter could broadly be classified as “social psychology” or some related field. In that case, we can use the data from Richard, Bond, and Stokes-Zoota (2003), who conducted a meta-analysis of meta-analyses in the field of social psychology to determine the range of typical effect sizes across the entire field, involving some 25,000 individual studies published over 100 years in diverse research areas. While the focus of this meta-meta-analysis was the field of social psychology, I believe there is little reason to expect the distribution of typical effect sizes to be appreciably different in other areas of psychology, such as cognitive psychology (if you are aware of large-scale meta-analytic data to the contrary, please let me know). Anyway, the figure below summarizes the distribution of effect sizes that they found.


Their meta-analysis actually examined the effects on the Pearson’s r (correlation) scale, and the bumpy density curve in the left panel shows their aggregated data (copied/pasted from their Figure 1). The smooth curve overlaying that data is the best-fitting beta distribution1, on which the percentiles and other statistics are based, and the curve in the right panel is based on applying a standard conversion formula to the smooth curve in the left panel2.

What this shows is that, in the absence of any other information about the study you are about to run, a pretty reasonable assumption about the effect size is that it is equal to the historical average: = 0.21 or d = 0.45. Or you could use the median, or be conservative and go with the 30th percentile, or whatever you want. The point is, we have enough information to make a pretty well-informed decision even if we have no specific information at all about the actual study.

Of course, in most cases in the real world, you probably do know something about the study you are about to run. In almost all cases, that knowledge will allow you to make an even more refined estimate of the effect size, either by finding a meta-analysis that looks more specifically at effects that are conceptually similar to yours (you could even start with Richard et al., who helpfully break down the average effect size in social psychology by broad research area), or just by starting with the aggregate historical estimate and adjusting from there based on how you think your study differs from the average study of the past.

You know less than you think after the study

The argument that opened this post pointed out that we don’t know the effect size before the study has been run. That’s true, but of course, we don’t know the effect size after the study has been run either. Instead what we have is some data from which we can construct an estimate of the effect size. Realizing this allows us to ask the quantitative question: Just how good of an effect size estimate do we have at the end of a typically-sized experiment? If our estimate of the effect size after an initial study is not much better than what we could already surmise based on the historical, meta-analytic data, then it doesn’t make a lot of sense to trust the former a lot more than the latter.

Consider a typical study in which we compare two independent groups with n=30 participants per group. Below I’ve simulated some data in which the standardized mean difference between the two groups is exactly equal to the historical average of d = 0.45. The figure below shows a bootstrap sampling distribution of the effect size in this situation3. If we ignore all prior information that we have about typical values of the effect size, as many researchers routinely do, then this sampling distribution summarizes everything we know about the effect size after running a pretty typical study.


Compare this distribution to the right panel of the first Figure from above, which showed our prior knowledge about the likely values of d. In terms of how much information they carry about d, the two distributions are really not that different. The sampling distribution is slightly less variable—it has a standard deviation of 0.27 rather than 0.37—but this difference in variability is quite hard to see from visual inspection.

Many researchers are hesitant to rely on power analyses based on historical information about d, but feel much more comfortable relying on a power analysis based on the sample d from an initial study like this. I submit that this doesn’t make a lot of sense because we don’t really have much more information about the likely values of d after running the initial study than we already had about d before running a single participant. In fact, for small pilot studies, the sampling distribution of d might actually be more variable than our prior distribution for d based on historical, meta-analytic data. Below is a figure that compares the variability of sample d (as a function of total sample size N) to the variability of our prior beliefs about d. We can see that the two distributions have the same standard deviation at approximately N=30. One way to view this is that, in the absence of any other information about the study to be run, we have about as much prior information about d as if we had run a pilot study with N=30 (and ignored all prior information rather than updating our beliefs about d in a Bayesian fashion).


Living with uncertainty

Whether we use historical data or data from previous experiments we have run, there will always be some uncertainty about the effect size. So there are a range of plausible assumptions we could make about the effect size when doing a power analysis, and these different assumptions imply different sample sizes to collect in the study. In many cases, the uncertainty will be pretty high, so that the range of recommended sample sizes will be quite wide, a fact which many researchers find disconcerting.

Uncertainty is a fact of scientific life and should be no cause for dismay. We have all (hopefully) learned to be comfortable with uncertainty in other aspects of the research process. Unfortunately, many researchers seem oddly unwilling to accept even modest uncertainty in the planning phase of the research. In responding to such a view, it’s hard to put it better than @gung did in this answer on Cross Validated:

“Regarding the broader claim that power analyses (a-priori or otherwise) rely on assumptions, it is not clear what to make of that argument. Of course they do. So does everything else. Not running a power analysis, but just gathering an amount of data based on a number you picked out of a hat, and then analyzing your data, will not improve the situation.”

Uncertainty is there whether we like it or not. We should try to make the best design decisions possible in light of that uncertainty. Power analysis is our best tool for doing so. Before I close the post, let me clarify: In my opinion, there is nothing wrong with planning experiments based on rules of thumb. I acknowledge that much of the time it won’t make sense to do a formal power analysis for each and every experiment, because often we won’t have a lot of specific information about the particular study we’re about to run beyond the kind of general information we have about the typical experiments we tend to run. My point is that we should apply statistically well-informed rules of thumb that are based on historical, meta-analytic data, and are calibrated to work pretty well in a range of realistic research situations—not dubious heuristics like an n=30 rule. One of the most important functions of power analysis is to help us construct such good rules of thumb.

Power analysis: Good for President Obama, good for you
Power analysis: Thanks, Obama!4

1 For those interested, the parameters of this beta distribution are about [math]\alpha=1.34, \beta=5.03[/math].

2 The correct conversion from Pearson’s r to Cohen’s d depends on the assumed proportion of participants in the two groups. The statistics that I present in the figure are based on the standard formula that assumes the group sizes are equal. I experimented with various ways of relaxing that assumption in a realistic manner, but ultimately found that the difference was negligible unless one assumes the group sizes tend to be markedly and unrealistically unequal.

3 The mean shown in the figure is the mean of the bootstrap distribution. This mean is slightly higher than the assumed value of 0.45 because the sampling distribution of d is slightly positively skewed, reflecting the fact that sample d is a slightly positively biased estimate of population d.

4 Thanks to Katie Wolsiefer for this figure caption, which is way better than my original.

Follow-up: What about Uri’s 2n rule?

This post is a quick follow-up to yesterday’s post on sample size rules. Basically I thought it was a little too long to go in the comments section so here it is.

Some people on twitter (and in my blog comments) remarked that my conclusion appears to fly in the face of some things Uri Simonsohn wrote on a similar topic not too long ago. Briefly, Uri writes of a Study 1 in which there are two independent groups (A1 and B1) and some non-zero effect size, and a Study 2 in which we move to a 2×2 design in which the difference between conditions A1 and B1 is the same size as before, but the difference between the other two conditions A2 and B2 is exactly 0, and we are now testing the interaction effect. Uri concludes: “To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.” Let’s call this Uri’s 2n rule*.

I thought Uri’s post was cool and certainly don’t think my point contradicts the point he made. The important thing to note here is that the effect size that Uri assumes to be constant in both studies is just the A1-B1 difference. But that’s not the effect we’re actually testing in Study 2: we’re testing the interaction effect, i.e., the A1 – B1 – A2 + B2 difference. And there is no assumption that the effect size for the interaction in Study 2 is equal to the effect size for the A1-B1 difference in Study 1. In fact, the 2n rule depends on the relevant effect size in Study 2 being half the relevant effect size in Study 1. That’s why you must increase the sample size when moving to Study 2; for the situation Uri’s talking about, the relevant effect size gets cut in half in Study 2. In my post I’m talking about cases where the relevant effect size is held constant. If the relevant effect size is held constant, then adding cells to the design has a negligible impact on power.

Numerical example

Consider the following table of cell means and standard deviations (the latter in parentheses).


Let’s say there are 20 subjects in each cell. Now if Study 1 involves only groups A1 and B1 (so that there is N=40 in total) then the power of the study is 34%. And if Study 2 involves all four groups (so that there is N=80 in total), then the power to detect the interaction effect is only 20%. But if we double the sample size (so that there is n=40 in each cell and N=160 in total), then the power to detect the interaction effect is 34%, just as it was in Study 1. This is the 2n rule that Uri wrote about, and I don’t dispute it.

But now let’s look at the standardized effect sizes for the two studies. We use Cohen’s d, defined as d=(\mu_1-\mu_2)/\sigma, where \mu_1 and \mu_2 are the two means being compared and \sigma is the pooled standard deviation, or equivalently, the root mean squared error (RMSE). Computing this in Study 1 is straightforward since there are only two groups; we have d=\frac{0.5-0}{1} = 0.5.

Computing d in Study 2 is a little less clear since in that case we have four groups and not two. We saw above that the relevant mean difference is the A1 – B1 – A2 + B2 difference. The key here is to realize that the interaction effect essentially still comes down to a comparison of two groups: We are comparing the A1 and B2 groups (which have coefficients of +1 in this difference score) against the A2 and B1 groups (which have coefficients of -1). So the two relevant means to use in computing d are the mean of the A1 and B2 means, and the mean of the A2 and B1 means. This gives us d=(\frac{0.5+0}{2}-\frac{0+0}{2})/1=0.25 for the interaction effect. In other words, the relevant effect size in Study 2 is half of what the relevant effect size in Study 1 was.

It’s easy to show this symbolically. Let d_1 be the effect size in Study 1 and d_2 be the effect size in Study 2. Then, starting with the classical definition of d,

d_2=\frac{\mu_1-\mu_2}{\sigma}=\frac{\frac{\mu_{A1}+\mu_{B2}}{2}-\frac{\mu_{A2}+\mu_{B1}}{2}}{\sigma}=\frac{(\mu_{A1}-\mu_{B1})-(\mu_{A2}-\mu_{B2})}{2\sigma} .

In this example we’ve assumed the \mu_{A2}-\mu_{B2} difference is 0, so that

d_2=\frac{\mu_{A1}-\mu_{B1}}{2\sigma}=\frac{d_1}{2} .

* Robert Abelson wrote about a similar phenomenon which he called the 42% rule: If the A1-B1 difference just reaches significance at p=.05 with size d, then the A2-B2 difference has to be at least 42% as large as d in the opposite direction for the interaction in Study 2 to reach p=.05 with the same sample size.

Think about total N, not n per cell

The bottom line of this post is simple. There are lots of rules of thumb out there for minimum sample sizes to use in between-subjects factorial experiments. But they are virtually always formulated in terms of the sample size per cell, denoted as small n. For example, one ubiquitous informal rule suggests using about n=30 or so. The problem is that cursory power analysis shows that rules based on small n don’t really make sense in general, because what is more directly relevant for power is the total sample size, denoted as big N. So if you must rely on rules of thumb—which I actually don’t have any big problems with—try to use sample size rules based on big N, not small n.

The example

The idea of writing about this came from a recent interaction with a colleague, which interaction I describe here in slightly simplified form. My colleague was reviewing a paper and asked my advice about something the authors wrote concerning the power of the study. The study involved a 2×2×2 between-subjects design with a total sample size of N=128, and the authors had remarked in their manuscript that such a study should have statistical power of about 80% to detect a canonical “medium” effect size of Cohen’s d = 0.5.

This did not seem right to my colleague: “This seems impossible. I recently read a paper that clearly said that, in a simple two-independent-groups design, one needs about n=64 in each group to have 80% power to detect a medium effect size. But in this study the sample size works out to only n=16 in each cell! Intuitively it seems like the required sample size for a given power level should increase as the design becomes more complicated, or at least stay the same, but definitely not decrease. So their power calculation must be flawed, right?”

The intuition seems reasonable enough when stated this way, but in fact it isn’t true. The problem here is the assumption that the relevant sample size for power purposes is the sample size per cell, small n.

Big N vs. small n

Mathematically, it’s a little tricky to say that power is a function of N rather than n. After all, one can write the relevant equations in terms of N or in terms of n, so in that sense power is a function of whichever one you prefer. The argument here is that power is a much more natural and simple function of N than it is a function of n, so that rules of thumb based on N are far more useful than rules of thumb based on n.

One could justify this argument rather formally by looking at the symbolic expression of the noncentrality parameter written in different ways. But really the most straightforward and probably most compelling way to see that it’s true is just to specify two sample size rules, one based on N and one based on n, and to compare the statistical power resulting from each rule for a few different designs and effect sizes, which is what I’ve done in the table below.


Here I’ve chosen an N=128 rule just to be consistent with the example from before, but the general conclusion is clear. Using a rule based on N, power is a pretty simple and well-behaved function of the effect size alone, regardless of the particular between-subjects factorial design being considered. Using a rule based on n, power remains a more complicated, joint function of the effect size and the factorial structure.

Final caveat / technical footnote

Here, for the sake of simplicity, I’ve restricted myself to examining designs where all the factors have 2 levels, sometimes called 2k factorials. In between-subjects factorials where some of the factors have >2 levels, the appropriate sample size rule is slightly more complicated in that it depends on the particular contrast being tested. In these cases, a rule based on the total number of observations that are actually involved in the contrast—which we might call N’—works pretty well as a simple approximation in most cases. The more technically correct (but more complicated) procedure depends on the product of N and the variance of the contrast; see this working paper of mine for more details.

The hierarchical ordering principle

(Background reading: Jeff Rouder’s blog post on the “dominance principle”, which this post is mainly a response to.)

How many subjects should we routinely aim to recruit when we conduct experiments using within-subjects designs? It’s well known that within-subjects experiments tend to be more sensitive than between-subjects experiments for detecting mean differences of a comparable size, so one reasonable albeit vague answer is: “fewer than we would recruit if it were a between-subjects experiment.”

Interestingly, there is no logically or mathematically necessary reason why this truism of experimental design must be true in general. That is, it’s theoretically possible for the within-subjects version of an experiment to be no more powerful than the between-subjects version, or even for it to be less powerful. But, as a matter of empirical fact, the within-subjects experiment typically is more powerful, often substantially so. This blog post is about different possible explanations for why this is usually true.

Distinct Sources of Random Variation

Before we can explain the observation described above, we need to clarify in more precise terms what it is we’re trying to explain. To do so, we must clearly distinguish the estimable sources of random variation that perturb subjects’ responses in a simple within-subjects experiment. To illustrate, below is a small, made-up batch of data from a simple design involving 10 subjects measured both before (“pre-test”) and after (“post-test”) undergoing some Treatment, with each subjects’ two responses tied together.


Subject variance: Some subjects are “high responders” who tend to have high scores on average, other subjects are “low responders” who tend to have low scores. Subject variance is captured by the variance of the subject means. In terms of the plot above, this corresponds to the variance of the mid-points of the 10 lines.

Subject-by-Treatment interaction variance: Some subjects tend to show large Treatment effects (big increases from pre-test to post-test) while other subjects tend to show small or even negative Treatment effects. In terms of the plot above, imagine that we characterized the slope of each line with a number giving the difference from pre-test to post-test. The Subject-by-Treatment interaction variance is captured by the variance of those slopes or differences.

Error variance: Imagine that rather than measuring each subject only once at pre-test and once at post-test, we had measured each subject multiple times before and after treatment. Naturally we would still expect some random fluctuation in a particular subject’s pre-test scores and some fluctuation in that subject’s post-test scores, due to measurement error or whatever else. This variation is error variance—it cannot be accounted for by any other variable measured in the experiment. In the present example, where we supposed that subjects were measured only a single time at pre-test and post-test, we still imagine that the responses are perturbed by some amount of error variance. However, without multiple measurements at each time-point, we cannot statistically distinguish the error variance from the SxT interaction variance; the design confounds them.

The Hierarchical Ordering Principle

Now we’re in a position to more precisely describe the empirical pattern discussed at the beginning of this post. Ready? Here we go.

In a between-subjects experiment, the relevant sources of variation—that is, the sources of variation that end up in the standard error of the overall Treatment effect—are Error variance and Subject variance. (The Subject-by-Treatment variance can’t be estimated in that case because we only observed each subject at pre-test or post-test, but not both.) But in a within-subjects experiment, the relevant sources of variation are Error variance and Subject-by-Treatment variance. (The variance in the subject means can be estimated in this case, but it is no longer relevant, since intuitively all that matters is each subject’s pre-test to post-test difference.)

Now, we assume the size of the effect being studied is the same regardless of whether we used a between- or within-subjects design. So if the within-subjects experiment is more powerful than the between-subjects experiment, this implies that the Subject-by-Treatment interaction variance is smaller than the Subject variance. This is exactly what the “hierarchical ordering principle” says:

“Hierarchical ordering […] is a term denoting the observation that main effects tend to be larger on average than two-factor interactions, two-factor interactions tend to be larger on average than three-factor interactions, and so on” (Li, Sudarsanam, & Frey, 2006, p. 34).

Empirically speaking, this seems to be true much of the time. But why? Below I examine three arguments for why we might expect to observe hierarchical ordering more often than not. As you read through the arguments below, remember what we’re about here: We’re trying to explain hierarchical ordering because this would in turn be an explanation for why within-subjects designs are typically more powerful than between-subjects designs. The arguments below are far from decisive, but they are interesting and plausible.

  1. With small effects, linearity dominates

One possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly due to the range over which experimenters typically explore factors. In the limit that experimenters explore small changes in factors and to the degree that systems exhibit continuity of responses and their derivatives, linear effects of factors tend to dominate. Therefore, to the extent that hierarchical ordering is common in experimentation, it is due to the fact that many experiments are conducted for the purpose of minor refinement rather than broad-scale exploration” (p. 34)

Huh? Actually this is not too hard to understand if we consider the situation graphically. The basic idea is illustrated in the figure below.


Imagine that we’re studying some dependent variable Y as a function of two independent variables X and Z. The effects that X and Z have on Y might have any functional form: they could have a joint, multiplicative (interactive) effect on Y, they could individually have diminishing or asymptoting effects on Y, or any other crazy, nonlinear sort of effects we might imagine. I set up the example above so that the true relationship is actually an ideal interaction: Y = XZ.

The purpose of the plot is to illustrate that for any particular point in the predictor space (any pair of X and Z values), in a sufficiently small neighborhood around that point, the function is approximately linear. In the plot, we pick a point and zoom in on it closer and closer so that the local neighborhood we’re considering becomes smaller and smaller. As we do so, we can see that the function immediately around that point comes to resemble more and more closely a tangent plane to the response function.

In an experiment, our goal is to push around X and Z to some extent and observe the corresponding changes in Y. But if we’re studying small effects—so that we push around predictor values X and Z only a small amount from condition to condition—then we are staying within a small neighborhood, in which case any higher-order, nonlinear effects of X and Z (such as interactions) are going to account for little of the observed variance in Y compared to the linear effects of X and Z, even if the “true” response curve is highly nonlinear across a broader range of predictor values. Of course, in psychological research, small effects are extremely common, maybe even the norm. To the extent that this is true, the argument goes, we should expect higher-order terms like Subject-by-Treatment interaction variance to account for relatively little variance in the outcome, and lower-order terms like Subject variance to account for relatively more variance.

  1. Experimenters transform variables so that simple effects are predicted

Another possible explanation given by Li et al. (2006) is that hierarchical ordering is

“partly determined by the ability of experimenters to transform the inputs and outputs of the system to obtain a parsimonious description of system behavior […] For example, it is well known to aeronautical engineers that the lift and drag of wings is more simply described as a function of wing area and aspect ratio than by wing span and chord. Therefore, when conducting experiments to guide wing design, engineers are likely to use the product of span and chord (wing area) and the ratio of span and chord (the aspect ratio) as the independent variables” (p. 34).

This process described by Li et al. (2006) certainly happens in psychology as well. For example, in priming studies in which participants respond to prime-target stimulus pairs, it is common in my experience for researchers to code the “prime type” and “target type” factors in such an experiment so that the classic priming effect is represented as a main effect of prime-target congruency vs. incongruency, rather than as a prime type × target type interaction. And in social psychology, there are many studies that involve a my-group-membership × your-group-membership interaction effect, which is often better characterized and coded as a main effect of ingroup (group congruency) vs. outgroup (group incongruency). It seems natural to expect individual differences in these more robust effects to have greater variance than individual differences in the incidental effects, which are now coded as higher-order interactions, and this would give rise to hierarchical ordering in the random variance components. Other transformations that could have similar effects are things like analyzing the logarithm of response times, the square-root of count variables, etc.

  1. Nested designs confound higher-order effects into lower-order effects

A final argument is that lower-order effects, like the Subject variance in our simple within-subjects design, often implicitly contain more confounded sources of variation than higher-order effects. For example, we saw above that in the between-subjects version of our experiment, the Subject-by-Treatment interaction variance could not be estimated. It turns out that this variance is implicitly confounded with the Subject variance (actually, all three sources of variance are confounded if we observe each subject only once) so that the Subject variance we observe—let’s call it var(S’)—is really the sum of the Subject variance, var(S), and the Subject-by-Treatment variance, var(SxT), so that var(S’) = var(S) + var(SxT).

This confounding may not seem obvious at first, but it’s pretty easy to see graphically. Consider the totally contrived dataset below. These data are generated under a within-subjects design where the subjects have exactly 0 variance in their mean responses. But imagine that instead of observing the complete data in a within-subjects design, we observed only the pre-test scores for half of the subjects and only the post-test scores for the other half of the subjects, in a between-subjects design. It would appear in that case that some subjects are high responders and others low responders—that is, it would appear that there is stable Subject variance var(S)—but obviously this is totally due to unobserved variance in the subject slopes, var(SxT).


The same idea extends to higher-order interaction effects. If we have a random two-way interaction that we can conceive of as being part of an incompletely observed three-way interaction, then we can suppose that the variance in that two-way interaction implicitly contains the variance of that unobserved three-way interaction. If we then assume that there’s a limit to how high up the design hierarchy we can plausibly take this process, this implies that, in designs in which there is some nesting of factors (including, notably, all between-subjects designs), lower-order effects will tend to have relatively greater variance by virtue of being implicit sums of a greater number of confounded higher-order interaction effects.


Li, X., Sudarsanam, N., & Frey, D. D. (2006). Regularities in data from factorial experiments. Complexity, 11(5), 32–45.