As a spoiler, I estimate that Harvey caused ** roughly 50 – 120 year flood events** at most sites in Harris county (the Texas county where Houston is located) where we have data—

Disclaimer: I am not a hydrologist. In fact, I know almost nothing about hydrology. What I am is an out-of-work data scientist. I don’t claim that the analysis I present below is particularly savvy or sophisticated — it’s possible I’ve overlooked some issues that would be more obvious to the hydrologically initiated — but I do think my analysis at least addresses all or most of the major methodological issues that need to be handled in such an analysis.

As the sources I linked above explain very clearly, the *N*-year flood is a probabilistic notion based on the estimated chance that flood waters will reach a certain height in any given year. In a nutshell, an *N*-year flood is defined as the flood water height that a given area has a 1/*N* probability of reaching in a particular year. For example, a 500-year flood has a probability 1/500 or 0.2% chance of occurring in a particular year, *by definition*. Now, it does not automatically follow that an *N*-year flood is expected to occur about every *N* years on average, but that conclusion *does* follow if we add the somewhat reasonable assumption that the flood events are independent from one year to the next^{1}. That assumption is probably not exactly true, due to factors like El Niño / La Niña cycles (or Atlantic Niño in Harvey’s case), but probably a good enough approximation that the intuitive interpretation “500-year floods are expected every 500 years” is reasonably true.

Now, it would be fantastically unlikely for Houston to experience three 500-year floods in three years if these were “truly” 500-year floods. A much more likely explanation is that the underlying risk estimate — 0.2% chance of occurring in a year — is based on a bad statistical model. We can use the historical stream gage^{2} height data from the U.S. Geological Survey (USGS; see also Harris county, TX’s very nice interface) to come up with a better estimate.

Here’s a very high-level outline of what the analysis looks like:

- Query the USGS web server to retrieve the last 20 years of gage height data from all stream gage sites in Harris County, TX.
- Detrend the data using a one-year rolling median.
- Estimate the cumulative distributive function (CDF) for each site’s data using kernel methods. Then locate Harvey in the CDFs to obtain the daily risk estimates
- Convert the daily risk estimates to
*N*-year flood estimates by making some assumptions about the form of temporal dependence. - Bootstrap to obtain estimates of uncertainty.

Retrieving the data is not too exciting, so I won’t say much about it in this summary post other than to point out the USGS Daily Values Web Service, which is a really nice system the USGS has set up for retrieving nicely formatted data via HTTP requests. (For those interested in this sort of thing, all the details and syntax are documented in the R Markdown document.) Instead we’ll jump ahead to the part where we have a nice, clean dataset.

The data are daily mean gage heights for 21 stream gages spread through Harris County, TX. There are more than 21 gage sites in the county, but these are the ones that had mean gage height data available^{3}. Here’s what the data look like at each site:

It mostly looks pretty well-behaved, with the obvious exception being some huge, discontinuous jumps in gage height that appear at some of the sites. The top 6 sites show the biggest jumps ; as you can see, these are all (and the only ones) from Cypress Creek. I have no idea what went down at Cypress Creek to cause these discontinuities, or why the jumps don’t at least all happen at or around the same time.

In principle what we care about in measuring these flood levels are the deviations from the *typical* stream heights, whatever those typical heights might be at that time, even if they have changed over the years (whether that’s through a slow drift or a big change all at once). In other words, we are interested in the detrended data. Plus, more practically speaking, we want to be able to still use the data prior to those discontinuous jumps in stream height that we saw above. One way or another, that requires accounting for the jumps by detrending them out.

A rolling median is a good candidate method for a nonparametric estimator of the trend line here because (a) it won’t be affected much by big, sudden shocks to the gage height (e.g., from floods), thereby leaving that variability to go into the CDF, which we want; and (b) in theory, a rolling median can model the discontinuous jumps much better than, say, a rolling mean because the median is not a “smooth” statistic like the mean is.

There is the question of the optimal window width for the rolling median (i.e., how many observations to take the median over in each direction). One idea would be to regard this as a hyperparameter to be tuned via cross-validation, * but* note that in this case minimizing the expected prediction error is not really our goal. Rather, our goal is to remove the biggest, coarsest trends in the data, while leaving all of the variability around that trend line in the data so that it will go into the estimated CDF. The trend line that minimizes the expected prediction error would actually follow the data a bit too closely for our purposes, getting pulled upward by large flood events and thereby eating up some of the variability that we want to keep. In other words, we want the trend line to underfit the data to some extent.

With a little experimentation I found that a one-year window worked pretty well. Here’s an example of what that looks like at a single representative site:

Looks great! The trend line remains relatively flat even in the presence of large local floods, but still perfectly captures the weird discontinuous jump in the time series as well as any very slow drifts over time.

Let’s fit one-year moving medians to all of the sites, detrend the data around those median lines, and then examine the resulting detrended data:

It looks pretty good! There are a handful of odd downward spikes that are probably the result of the median line not capturing the discontinuous jumps as gracefully as possible, but these seem small enough not to worry about.

This step is pretty straightforward, at least thanks to the sROC package, which has nice functionality for kernel estimation of CDFs. Here’s an example of what the probability density function (PDF) and CDF look like for one representative site, and where the Harvey floods fall in this distribution.

Okay, so now we can look up the Harvey flood in the CDFs for each site, which will tell us the proportion of the distribution that is less than the Harvey flood. Let one minus this proportion be , the probability of a flood at least as great as Harvey on a particular day. Note that it’s for a particular day, not a particular year, because our data are *daily* gage heights. Now we just need to convert this to , the probability for a particular year, and then take to obtain the *N*-year flood estimate. So how do we do the conversion?

If the daily flood events were independent then we could use

,

which is the probability of at least one “success” from a Binomial distribution with 365 trials. But the daily flood events aren’t really independent: if there’s a flood event on one day, that substantially raises the probability that there will still be a flood event on the next day, because it takes a few days for the water levels to recede to below flood level. For example, here’s what the data look like from the big floods in Houston last year, summer 2016, again from our favorite representative gage site:

I drew a hypothetical flood height level to represent a particular *N*-year flood, just for illustrative purposes (I have no idea what flood level this line would actually correspond to). We can see that in this particular case, we had a string of five *N*-year flood days in a row. More generally, this should make it clear that we expect the daily flood events to be dependent in that they tend to come in clusters. So to convert to for Harvey, we need a model of this dependence.

A simple model is that the daily flood events show Markov or one-day-back dependence: the probability of a flood event today depends on the presence of a flood event yesterday, but is conditionally independent of flooding 2 days ago given the flood status yesterday. For this model we add a parameter : the conditional probability of a flood event, given that there was a flood event the previous day. Then the conversion formula from above becomes^{4}

,

which reduces to the previous formula when , meaning the events are independent.

To get a feel for how this new parameter affects the model estimates, we can plot the *N*-year estimates from all 21 sites as a function of the assumed value of (with set to the previously estimated values):

As you can see, the assumed value of doesn’t have a huge impact until it gets up to around, say, 0.6 or so, but then it starts to matter quite a lot. To think about what plausible values of might be, it helps to notice that, according to our dependence model, flood durations follow a Geometric distribution with parameter , implying that the expected flood duration is days. So, for example, when , then we expect it to take an average of 4 days before water levels recede to below the relevant *N*-year flood level (but not necessarily back to normal levels).

We’ve already developed a good estimate of using the kernel CDFs. To estimate , basically we can (a) choose a particular flood height, ; (b) categorize the gage heights at any given time point as being either above or below , giving us or , respectively; and (c) create a contingency table of vs. (the value on the previous day), which gives us a table like:

from which we can compute an estimate of as . Now this estimate, of course, depends on the flood height that we chose, and it’s certainly possible that there could really be different values of for different flood heights. So here we plot the estimated values of for all 21 sites, for heights ranging from the 90th percentile to whatever percentile corresponds to the Harvey floods at each site (in other words, for that site):

Immediately we see that there is considerable variation across gages in their level of dependence, and that the degree of dependence depends on the flood height (in an interesting, non-monotonic way). Because the estimates depend fairly strongly on the site and on the height of flooding, it wouldn’t be a good idea to use an aggregate estimate that averages across all sites and/or all flood heights. Instead, for our *N*-year flood estimates we will use each site’s individual estimate at the value corresponding to the Harvey flood height at that site.

We now have point estimates of and for all sites, which means we could technically now compute our *N*-year flood estimates for each site. But to use only the point estimates would be to ignore the substantial uncertainty associated with these estimates! Furthermore, because we obtain the *N*-year estimates from a convex transformation of and , using only the point estimates would lead us to systematically underestimate the *N*-year flood levels (due to Jensen’s inequality). So we will use bootstrapping to preserve the variability of the estimates.

Because our data are time series, a simple nonparametric bootstrap is not a good idea because the temporal structure of the data (such as the autocorrelation) would be destroyed in each bootstrap sample, which would ruin our estimates (although the estimates would probably still be okay). So here we will apply a more specialized bootstrapping method designed to work with time series data, called the stationary bootstrap. Then we will basically treat each bootstrap-resampled distribution as a poor man’s posterior distribution and go from there. Here’s what we get when we run the stationary bootstrap on a single, representative site:

…Huh. The bootstrap results are not totally crazy, but there are clearly a couple of issues. First and most obviously, the bootstrap estimates of have come out sort of discretized for some reason—there are only 6 unique values—and with a substantial fraction of the estimates equal to exactly zero (which makes them unusable). Do we really think the posterior distribution for truly has this kind of shape? Well, no, probably not. So, if you’ll indulge me in some admittedly ad-hoc adjustment, I’d like to apply a correction to “smooth out” the marginal distribution of the estimates. You can think of this as applying a bit of regularization “by hand.”

Now, I’m willing to trust that the mean and standard deviation of this distribution are more or less right; it’s just the shape that I don’t trust. So I’ll apply a two-stage^{5} transformation that reshapes the empirical distribution into the Beta distribution with the same mean and variance. Note that we “reshape” the existing distribution of values, rather than simply taking new draws from the corresponding Beta distribution, so that we preserve the form of the dependence between the and estimates. So here’s a graphical comparison of the old and new marginal distributions:

Now if we turn our attention to the bootstrap estimates of , we see a less obvious but still somewhat strange issue, which is that the conditional distribution at has a weird bimodal shape—in particular, there is an odd cluster of estimates that are effectively 0. Closer inspection of the estimates as a function of the flood height sheds some light on what’s going on here:

Recall from the Figure 8 caption that the estimates are predicted values from a weighted GAM fit (on the logit-logit scale) to the simple estimates resulting from the procedure illustrated Figure 7, because the simple estimates are rough and unstable, particularly at extreme values of the flood height . Basically what is happening here is that, for a nontrivial minority of the bootstrap-resampled datasets, the simple estimates become empirically undefined (i.e., they involve division by zero) above a certain value of , because by chance the resampled dataset ended up with no data points near or beyond that value of . So when that happens, the GAM simply does a linear extrapolation from the highest non-missing to the value corresponding to Harvey, and uses the extrapolated value as the bootstrap estimate. That works fine in a lot of cases, but in other cases it can yield wacky/divergent estimates, depending on where along the sequence the simple estimates become missing.

It seems like a good idea to discard the divergent bootstrap estimates. One way to do this is to fit a mixture model to the bootstrap estimates, use the model to classify each estimate as either probably coming from the “true” bootstrap distribution or probably coming from a (much wider) contaminant distribution, and then to discard the estimates classified as probably being contaminants. That process is illustrated in the right panel of Figure 11, which shows that it works pretty well, at least in the sense that the mixture model seems pretty plausible.

After applying this correction, along with the previously described adjustment to the estimates, we end up with an adjusted joint distribution of bootstrap estimates that looks like this:

So if we compute the adjusted bootstrap distributions for all 21 sites, then after much ado, that gives us the following final estimates for each site:

As you can see, most of the *N*-year flood estimates have a *lot* of uncertainty around them. Estimating the risk of extremely rare events is an inherently hard statistical problem, so it’s not surprising that it’s hard to pin down precise estimates.

Of course, the method I’ve applied here is just one way to approach the problem. Basically it’s the first thing I thought of. I initially intended this to be a rough-and-ready analysis to satisfy my curiosity, but after following the rabbit hole of Markov dependence and bootstrapping complications a bit further than I anticipated, I’m not so sure if the “rough-and-ready” description is still apt.

There is a temptation to try to aggregate in some way the individual *N*-year flood estimates at each site, in an attempt to estimate the *N*-year flood level of the Harvey floods as a single, atomic event. After all, media reports of the risk/severity of the Harvey floods just present a single figure like “500 year flood” or “1000 year flood,” not broken down by site. This kind of aggregation is problematic in at least a couple ways. First, on a more conceptual level, it’s not clear which sites should be aggregated over. A simple approach might be to simply aggregate over all the sites in Harris county, which is sort of what I did (I mean, I avoided aggregation, but I did simply analyze all the sites in Harris county). According to my girlfriend—who actually does know something about hydrology—what would probably make more sense would be to aggregate over all the sites within a particular watershed, ignoring geologically arbitrary county lines. Second, on a more methodological level, the statistical problem of estimating the probability of a rare event across the entire constellation of 21 sites is quite formidable. For example, a simple extension of my method would involve first estimating a 21-dimensional joint density of the flood heights at all gage sites… the thought of which is enough to make one cringe. (I actually spent quite a while trying to get this to work, but ultimately just didn’t trust the answers I was getting, so I decided not to post any of that here.)

This method I used here does not take into account the fairly obvious possibility that the yearly risk of Harvey-level flood events may not be constant, but instead may be increasing over time. Including such time-dependence in the model would make this already difficult statistical problem even more challenging. This increasing-risk hypothesis is certainly plausible, but is quite complicated to think about because it depends on a lot of factors, such as the rate and nature of current/future urban development in the Houston area, preparatory measures that may be taken up by Houston, the possibly increasing frequency of large hurricanes due to climate change, etc. Regarding this last point, it seems that the evidence for an increasing frequency of Category 4/5 hurricanes is rather mixed.

Finally, there is some reason to think that almost any method applied immediately in the wake of Hurricane Harvey will lead to some degree of overestimation of the probability of similar events in the future (or, equivalently, *N*-year flood estimates that are too small). This point is actually quite subtle, but bear with me. Hypothetically, across many possible timelines, if we always waited for a particular event (such as a flood of a certain size) to occur and then estimated the probability of that event immediately after it occurred, our probability estimates would end up inflated to some extent (on average across the timelines) due to a form of selection bias that seems related to the recently renewed discussion of hot hand effects. For a simplified model of how this selection bias would work, consider a situation where we are observing a sequence of tosses of an unfair coin that comes up heads (H) with probability , and we construct an estimate by waiting until we observe the first H and then computing the proportion of H outcomes in the sequence. Then

,

where is the probability mass function for a Geometric distribution. On average, this exceeds when is less than about 0.35 or so, and by quite a lot when H is rare. For example, if , then , so that the estimates overstate the probability of heads by almost a factor of 5 on average. This is not exactly what’s going on in our Harvey analysis, but it’s quite similar. It’s hard to know what to do (if anything) about this kind of selection bias, but it’s interesting to think about.

^{1} More specifically, that floods follow a Bernoulli process over years or a Poisson process in continuous time

^{2} The USGS consistently uses the spelling “gage” instead of the more familiar (to me) “gauge.” Your guess as to why is as good as mine. Anyway, I’ll follow the USGS’s spelling convention.

^{3} Note that I use the daily *means* here, even though arguably it might make more sense to use the daily *maximums*. But unfortunately the daily maximums are only available for 3 of the gages in Harris county.

^{4} See the R Markdown document for the derivation of this.

^{5} The first step of the transformation—transforming the observed distribution to a Uniform distribution—is achieved in this case by taking , where is the number of values, and *breaking ties randomly* so that the 6 unique values become a smooth distribution.

Link to original TalkStats thread, September 17, 2013.

Today a labmate asked the following question: if we have three random variables x, y, z, and we know the correlations and , what constraints if any does this place on the correlation ?

At the time I reflexively answered that the remaining correlation must be the product of the two known correlations. Which of course is totally wrong. I think I was getting some mental interference from some of the equations for simple mediation floating around in my head. Anyway, after thinking about it for a while I have come up with a convincing geometric argument for what the constraints actually are. I have also verified that my answer agrees with a more complicated-looking answer to this question that I found elsewhere online.

Because I ending up spending a lot of time on this and I thought some of you would find the results interesting, I thought I would share my work here. Comments are welcome!

Okay. So imagine our variables x, y, z as vectors in n-dimensional space. The Pearson correlation coefficient between any two of these variables can be interpreted as the cosine of the angle between the corresponding vectors. This is an interesting and well-known geometric fact about correlation coefficients.

So now imagine that the x and y vectors are fixed (and hence so is their correlation), but that the vector z is free to vary so long as is constant. This constraint on means that the set of possible z vectors will form a sort of “cone” around the y vector, as in the following image:

Now it is intuitively obvious (I know this is a sneaky phrase, but that’s why I call this just an “argument” and not a “proof”) that the two possible z vectors that will lead to the minimum/maximum values of are the z vectors that lie on the same plane as the x and y vectors. This leads to the following expression for the minimum/maximum values of given and :

.

One notable result following from this is that if x is orthogonal to y and y is orthogonal to z, then there is no constraint on , it can be anywhere from -1 to +1. But under any other circumstances, fixing and will place some constraint on the range of .

Okay, now for the verification part, which requires a bit of math.

So in this stats.stackexchange.com thread it is stated that the three correlations must satisfy

,

the reasoning here being because this is the determinant of the correlation matrix and it cannot be negative. Anyway, this can be viewed as a quadratic inequality in , already in standard form:

.

So if we apply the quadratic formula and simplify the result, we get the following for the minimum/maximum values of :

.

Now taking my answer and applying the trig identity we get

.

Now applying the identity we get

,

which is the answer we got from the stackexchange post. So our simpler, geometrically based answer agrees with the more conventional answer that is harder to understand.

]]>Link to original TalkStats thread, June 24, 2012.

Following the discussion in this recent thread and spunky’s request therein, here is a neat trick one can use to test arbitrary contrasts for ANOVA models, and even multiple degree of freedom tests, using only some basic summary statistics of the kind that would be reported in a manuscript — *without* needing access to the raw data.

We need the following three pieces of information to do these tricks:

- the cell means
- the sample sizes per cell
- at least one F-ratio corresponding to any of the possible contrasts from the ANOVA model

So let’s say we have a manuscript on our desk in which the authors conducted a 2*2 factorial ANOVA, with factors (having levels and ) and (having levels and ). They give a table of means and sample sizes by cell, but for whatever reason only report the test of the interaction. So we have the following information:

```
# cell means
round(tapply(dat$y, list(A = dat$A,B = dat$B), mean), 2)
# B
# A -1 1
# -1 103.54 99.28
# 1 99.59 102.61
# cell sample sizes
table(A = dat$A,B = dat$B)
# B
# A -1 1
# -1 18 23
# 1 23 16
# t statistic for interaction contrast
summary(lm(y ~ A + B + AB, data=dat))$coef["AB",]
# Estimate Std. Error t value Pr(>|t|)
# 1.819051097 0.570385136 3.189162871 0.002073302
```

Perhaps as curious readers we are interested in knowing whether the cell differs from the other three cells. In other words we want to test the contrast below labeled “new”:

```
cbind(contr, new = c(3,-1,-1,-1))
# A B AB new
# [1,] 1 1 1 3
# [2,] 1 -1 -1 -1
# [3,] -1 1 -1 -1
# [4,] -1 -1 1 -1
```

Following the formula here, the F-ratio can be computed as

where and are the numbers of parameters in the full model and the nested model, respectively; and is the total sample size.

So the only two missing quantities here are SSR and SSE. If we can get those we can compute the desired F-ratio.

Given a particular contrast ,

SSR for =

where is the contrast weight for group , is the mean for group , is the number of groups, and is the number of observations in group .

So in this data we have

.

Now we need to get SSE. To do this, we can use the same formula to compute SSR for a contrast for which we already know F, and then rearrange the F-ratio formula to solve for SSE.

So for the known interaction contrast we have

.

Solving the F formula for SSE gives

.

Since , we can now just plug in the numbers to get

,

so that finally we have

.

And we can check our work by running anova() on the dataset after recoding the contrasts:

```
anova(lm(y ~ other1 + other2 + new, data=dat))
# Analysis of Variance Table
#
# Response: y
# Df Sum Sq Mean Sq F value Pr(>F)
# other1 1 38.06 38.056 1.4988 0.224639
# other2 1 191.60 191.604 7.5462 0.007505 **
# new 1 41.50 41.496 1.6343 0.205002
# Residuals 76 1929.70 25.391
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Aside from some minimal rounding error, we have it.

Given the same data as above, suppose now that for some reason we wish to see the 2 degree of freedom test comparing the full ANOVA model (factors A, B, and their interaction) to a model that only includes factor A.

We already saw above how to solve for SSE (and in fact we already computed it–we are using the same Full model so SSE will be the same). However, here we will get SSR in a slightly different way, using

,

where is the predicted value for observation under the smaller or reduced model, is the predicted value for observation under the larger or full model, and is the number of observations. Essentially, we are treating the predicted values from the more complex model as the data to be predicted and then computing the sum of squared errors in the normal fashion.

In the ANOVA case, this formula can be written more simply as

,

where is the number of observations in group , is the predicted value for group under the smaller or reduced model, is the predicted value for group under the larger or full model, and is the number of groups.

The predicted values from the Large model are straightforward: they are the group means. For the Small model, we have two sets of predicted values, those for and , and in both cases these predicted values are weighted averages of the two cell means at each level (i.e., collapsing across the factor), weighted by cell size.

For :

For :

So using the simplified SSR formula, we have

,

which makes our F-ratio

.

Checking our work:

```
anova(lm(y ~ A, data=dat),
lm(y ~ A + B + AB, data=dat))
# Analysis of Variance Table
#
# Model 1: y ~ A
# Model 2: y ~ A + B + AB
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 78 2198.8
# 2 76 1929.7 2 269.06 5.2983 0.007013 **
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

And again we have it, save for minimal rounding error.

]]>**List last updated: June 20, 2015**

Baayen, R. H. (2008). __Mixed models.__ Chapter 7 in *Analyzing linguistic data: A practical introduction to statistics using R*. Cambridge University Press.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). __Mixed-effects modeling with crossed random effects for subjects and items.__ Journal of memory and language, 59(4), 390-412.

Brysbaert, M. (2007). __The language-as-fixed-effect-fallacy: Some simple SPSS solutions to a complex problem.__ London: Royal Holloway, University of London.

Carson, R. J., & Beeson, C. M. (2013). __Crossing language barriers: Using crossed random effects modelling in psycholinguistics research.__ Tutorials Quant Meth Psych, 9(1), 25-41.

Hoffman, L., & Rovine, M. J. (2007). __Multilevel models for the experimental psychologist: Foundations and illustrative examples.__ Behavior Research Methods, 39(1), 101-117.

Janssen, D. P. (2012). __Twice random, once mixed: Applying mixed models to simultaneously analyze random effects of language and participants.__Behavior Research Methods, 44(1), 232-247.

Judd, C. M., Westfall, J., & Kenny, D. A. (2012). __Treating stimuli as a random factor in social psychology: A new and comprehensive solution to a pervasive but largely ignored problem__. *Journal of Personality and Social Psychology, 103*(1), 54-69.

Locker, L., Hoffman, L., & Bovaird, J. A. (2007). __On the use of multilevel modeling as an alternative to items analysis in psycholinguistic research.__Behavior Research Methods, 39(4), 723-730.

Quené, H., & Van den Bergh, H. (2004). __On multi-level modeling of data from repeated measures designs: A tutorial.__ Speech Communication, 43(1), 103-121.

Quené, H., & Van den Bergh, H. (2008). __Examples of mixed-effects modeling with crossed random effects and with binomial data.__ Journal of Memory and Language, 59(4), 413-425.

Richter, T. (2006). __What is wrong with ANOVA and multiple regression? Analyzing sentence reading times with hierarchical linear models.__ Discourse Processes, 41(3), 221-250.

Sorensen, T., & Vasishth, S. (2015). __Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists.__

Winter, B. (2014). __A very basic tutorial for performing linear mixed effects analyses.__ arXiv preprint arXiv:1308.5499.

Well, I just discovered a new tool for reasoning about missingness that I’m pretty excited about, and I’d like to share it with you. Basically the idea is to use causal graphs to graphically represent our assumptions about the patterns of causality among the observed variables *and their missingness mechanisms*. And then we can visually follow simple path-tracing algorithms along the graph to answer questions like

- What parameters (such as regression coefficients or population means) can I
—meaning construct estimates that are consistent with what we would get with no missing data—by just analyzing the non-missing cases (i.e., listwise deletion)? This is called ignorable missingness.*recover* - For missingness that is ignorable, which auxiliary variables should I condition on (e.g., statistically adjust for or include in an imputation model), and which should I actually
*avoid*conditioning on?

This approach works so well because causal graphs are all about deducing which observed variables should be conditionally independent given which other variables—and it so happens that Rubin’s theory of missing data is naturally phrased in terms of conditional independencies.

There’s a lot to say about DAGs… frankly too much to fit in one tutorial-style blog post. For a crash course in what they are and how the work, I’d recommend the resources listed here. In the interest of keeping this post as concise as possible, I’ll assume you either have some prior familiarity with DAGs or that you’ve perused some of the crash course material I just referenced, but I will at least provide brief reminders about key concepts as they are needed.

The big idea is to augment the usual DAGs by adding nodes that represent the missingness in any variables that contain missing values, creating a so-called *m*-graph. For example, here’s a DAG for a situation where we have a predictor of interest *X*, an outcome *Y*, and two concomitant variables *W* and *Z:*

The causal model underlying this DAG assumes that

*X*directly causes*Y*and*Z**Z*and*Y*are associated through an unobserved confounder^{1}*W*causes*Y*independently of*X*.

Now we suppose that *Y* and *X* contain missing values, so we add corresponding missingness nodes and . Specifically, we create the following *m*-graph:

What are we saying about the missingness in *Y* and *X* when we posit this causal model? We are saying that

- The missingness in
*X*is directly caused by the (partially observed) values of*X*itself. An example of this would be if high-income respondents to a survey were less likely to disclose their incomes specifically because they didn’t want their high income to be known. In other words, high values are missing*because they are high values*. In Rubin’s terminology,*X*is*missing not at random*(**MNAR**). - The missingness in
*Y*is statistically associated with the values of*Y*, but this association is entirely due to the common cause*W*. An example of this would be if a high level of education (W) both causes income (Y) and causes respondents to (for some reason) be less likely report their income (). In Rubin’s terminology,*Y*is missing at random (**MAR**) given*W*.^{2}

Now that we’ve laid out our assumptions about how missingness is related to the relevant observed and unobserved variables, we can apply relatively simple graphical criteria to help answer the kinds of questions I laid out near the beginning of this post. These criteria are due to the amazing work of Karthika Mohan and colleagues (see References). Here I’m only going to focus on the conditions for recovering regression coefficients, but this research also lays out criteria for recovering the full joint, conditional, or marginal distributions of all variables in the graph.

We have separate *necessary* or *sufficient* conditions for recovering the coefficients from a regression predicting *Y*. A *necessary* condition is:

*Y*(the outcome) cannot directly cause (or vice versa, although that would be strange).

We can see in the *m*-graph that Condition 1 is satisfied. Although *Y* and are statistically related (through their common cause *W*), *Y* does not directly cause . The predictor *X* *does* directly cause its missingness node , but this doesn’t matter for estimating a regression of *Y* on *X*. To give an intuition about why we can have MNAR in *X* but not *Y*, consider this figure from Daniel et al. (2012):

For simplicity, the figure considers a crude missing data mechanism where the observation is simply removed if it exceeds some fixed value (i.e., truncated). In the top panel (a), the missingness depends directly on the value of the predictor *A*. Despite this, we can still recover the slope simply by analyzing the complete cases. In the bottom panel (b), the missingness depends directly on the value of the outcome *Y*. As the figure shows, this has a distorting influence on the slope which precludes the consistent recovery of the regression coefficient.

Meeting the necessary Condition 1 alone does not guarantee that we can recover the regression coefficient of interest; it just says that it might be possible. A *sufficient* condition for recoverability of the regression of *Y* on *p* predictors is:

*Y*is*d*-separated from the missingness nodes by the predictors .

Recall that two nodes *A* and *B* are *d*-separated by *C* if conditioning on *C* blocks all open paths between *A* and *B* (keeping in mind that colliders act “in reverse”: colliders block open paths unless they are conditioned on). If Condition 2 is satisfied, then we can recover the regression of *Y* on simply by analyzing the non-missing observations at hand.

Glancing at the *m*-graph in Figure 2 we can see that the regression of *Y* on *X* does not meet Condition 2, because *Y* and remain d-connected through *W*, which is not in the set of predictors.

So basically we have two options for how to proceed. The first and easier option is to consider instead a regression of *Y* on both *X* and *W*. Essentially, we decide that rather than seeking , the simple regression coefficient of *Y* on *X*, we will settle instead for , the partial regression coefficient that adjusts for *W*. In that case, Condition 2 is satisfied because this new predictor *W* *d*-separates *Y* from . So we could recover the coefficients from this new multiple regression simply by analyzing the non-missing observations at hand.

The second option is for when we really do want and not . In the graph from Figures 1 and 2, there’s not really any good reason for preferring one over the other, since the graph implies that *X* and *W* are independent in the population and thus . So in that case we may as well just estimate , since that would satisfy Condition 2. But what if the situation were slightly different, say, with *X* being a cause of *W*?

In this new graph, the path is part of the total causal effect of *X* on *Y*, so we don’t want to condition on *W*. In this case, instead of Condition 2 we consider a new condition that allows for a set of *q* auxiliary variables that help *d*-separate *Y* from the relevant missingness nodes, but that are not conditioned on in the regression of *Y* on the predictors . The (*sufficient*) condition is:

*Y*is*d*-separated from the missingness nodes by the set of predictors and auxiliary variables .**Furthermore**, for any auxiliary variable that contains missing values, we can recover the coefficient from regressing on the predictors by Condition 2 or by a recursive application of Condition 3.

If Condition 3 is satisfied, then is recoverable by constructing an estimate from (a) the regression of *Y* on the predictors and auxiliary variables and (b) the regressions of each auxiliary variable on the predictors. For example, in the simple case where the structural equation for *Y* is linear and additive in *X* and *W*, we can use the well-known fact that

In the present case, conditioning on the auxiliary variable *W* satisfies Condition 3 because (a) *W* *d*-separates *Y* from , and (b) *W* contains no missing values.

So now let’s return to the *m*-graph in Figure 2, where *X* does *not* cause *W*. We see that if we do *not* condition on *W* by adjusting for it in the regression model, then we still meet the necessary condition for recoverability (Condition 1), but we fail to meet the sufficient conditions for recoverability (Conditions 2 and 3). So we fall in a sort of in-between case… is the estimate of recoverable or not? (Note that we saw above that it is recoverable if we apply the procedure dictated by Condition 3, but here we are asking if we can “*directly*” recover the estimate just from the simple regression of *Y* on *X* among the non-missing cases.)

It’s impossible to have complete necessary-and-sufficient conditions in this situation because may or may not be directly recoverable, *depending on the functional forms of the structural equations*. Remember that a DAG is totally nonparametric in that the structural equations it specifies can have any functional form and the variables can have any distributions (within reason). So for the DAG in Figures 1 and 2, the regression of *Y* on *X* and *W* could be a linear, additive function of *X* and *W*, or an interactive/multiplicative function, or any crazy nonlinear function. It turns out that, in this particular case, **is** directly recoverable if the effects of *X* and *W* on Y are linear and additive, but **is not** directly recoverable if the regression involves an interaction. This figure should help to give some intuition about why that’s the case:

For simplicity, Figure 5 considers an example where the auxiliary variable *W* is a binary dummy variable. In the additive example, simply analyzing the non-missing cases yields the same estimate (on average) as if all the data were fully observed—whether or not we condition on *W*. In the interactive example, the group with more missing data has a greater slope, so analyzing the non-missing cases skews the total estimate toward the group with the smaller slope. But by conditioning on *W*, we can either (a) recover the conditional effect of *X* on *Y* given *W* (through Condition 2) simply by analyzing the non-missing cases, or (b) recover the marginal effect of *X* on *Y* (through Condition 3) by taking a weighted mean of the regression lines at *W*=0 and *W*=1, with the weights proportional to the number of cases in each group prior to missingness (which we know, despite the missingness, since the missing values still have partial records on the other observed variables).

While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic. It’s often assumed that having data missing not at random (MNAR) basically trashes your analysis if a non-trivial fraction of data are missing, which is frequently true in observational data. But the necessary and sufficient conditions laid out here suggest that, in fact, the simple strategy of listwise deletion—simply “ignoring” the missingness and analyzing the non-missing observations that are at hand—yields robust estimates under a pretty wide range of missingness mechanisms. And even in a lot of cases where you can’t just ignore the missing data problem, you can often still construct a consistent estimate of your parameter of interest (as in Condition 3) without needing to use fancy procedures like multiple imputation or distributional modeling of the predictors. Even further, you may still be able to recover the parameter estimates of interest even if you can’t satisfy the sufficient conditions given here (as in the final example above). Finally, and more anecdotally, simulations I played around with while preparing this post suggest that, even when the regression coefficient of interest is not technically recoverable, the magnitude of bias is often small under many realistic conditions (see, for example, the magnitude of bias in the simulations of Thoemmes et al., 2015).

Daniel, R. M., Kenward, M. G., Cousens, S. N., & De Stavola, B. L. (2012). Using causal diagrams to guide analysis in missing data problems. *Statistical methods in medical research*, *21*(3), 243-256.

Mohan, K., Pearl, J., & Tian, J. (2013). Graphical models for inference with missing data. In *Advances in neural information processing systems* (pp. 1277-1285).

Mohan, K., & Pearl, J. (2014). Graphical models for recovering probabilistic and causal queries from missing data. In *Advances in Neural Information Processing Systems* (pp. 1520-1528).

Thoemmes, F., & Rose, N. (2014). A cautious note on auxiliary variables that can increase bias in missing data problems. *Multivariate behavioral research*, *49*(5), 443-459.

Thoemmes, F., & Mohan, K. (2015). Graphical representation of missing data problems. *Structural Equation Modeling: A Multidisciplinary Journal*, *22*(4), 631-642.

^{1} Recall that in DAGs, doubled-headed arrows like are just shorthand notation for , indicating an unobserved common cause *L*.

^{2} Rubin’s third type of missingness, *missing completely at random* (MCAR), would be represented graphically by the relevant missingness node *R* being completely disconnected from all other nodes in the graph. An example of this would be if we flipped a coin for each observation and deleted the corresponding value when the coin landed heads. Although commonly assumed for convenience, MCAR is not actually very common in practice unless it is deliberately built into the study design.

We’re used to thinking about the number of participants as being, apart from the effect size, the chief factor in determining the statistical power of a study. And most of the time, in the kinds of single-lab studies we tend to run, this is basically true. Other factors matter as well — for example, the number of times we observe each participant, and the proportion of observations in each cell of the design — but as long as these other factors are within a typically sane range, they tend not to matter as much as the number of participants.

So it is perhaps natural that we apply this same heuristic to multi-lab replication projects. We reason that even if each lab can only recruit, say, 100 participants, and even if we can only recruit, say, 5 labs, this will give us a total of 500 participants, so statistical power will still be quite high! Right?

But here’s the thing. The reason the number of participants has such a big impact on power in the single-lab case is that, in those cases, the participants are the highest-level units or clusters in the design. That is to say, we can potentially have multiple observations of each participant — for example, we collect 50 reaction times from each participant — but these observations are clustered within participants, which are the high-level units. It turns out that the proper generalization of the “number of participants is important” heuristic is, in fact, “the number of *highest-level units* is important — the lower-level units, not as much.”

So now consider a multi-lab replication project. Here, the participants are clustered in labs. So the *labs* are the highest-level units. Remember the earlier example about having a study with 5 labs, each with 100 participants? In terms of its statistical power, this would be about like running a single-lab study with 5 participants, each of whom contributes 100 reaction times. In other words, it wouldn’t be great.

Let’s look at some actual power results. We consider a simple multi-lab design where we have labs, each of which recruits participants that are divided into two conditions ( participants per condition), and we observe each participant only a single time. In other words, we have a simple two-group between-subject experiment that is being replicated at different labs, and the labs have random effects. The key quantity for determining the power of the study is , the noncentrality parameter (for a non-central t distribution). It looks like this:

where is the effect size (Cohen’s d), is the proportion of random variation due to error variance (i.e., the ratio of the Error variance over the [weighted] sum of all the variance components), and is the proportion of random variation due to Lab variance (actually, it’s the proportion of Lab-by-Condition interaction variance, but I’m calling it Lab variance for short). Statistical power is pretty much determined by the noncentrality parameter — there’s technically also some influence of the degrees of freedom, but that tends not to matter much as long as it is above, say, 20 or so. So basically, we can understand the power of this multi-lab replication design by considering this noncentrality parameter expression.

First, let’s just plug a range of plausible values into the variables comprising and see what values of statistical power they imply. Here’s a series of contour plots where we vary the variables within plausible ranges.

The middle panel represents what I think are the most plausible values. There are a couple of interesting things to point out about these power results. The first is that…

The way to see this in the graphs is to consider the angle of the contours in each plot. More specifically, for any given point (i.e., pair of sample sizes) in any of the plots, consider the direction in which we would want to step to increase power fastest. For most parameter combinations, the path of steepest ascent up the power surface goes up along the y-axis (number of labs) more than it goes sideways along the x-axis (participants per lab). This is especially true when there is a lot of Lab variance (, in the right-hand column), but is still usually true when there is little Lab variance (, in the left-hand column).

There is another way of visualizing this that makes this point more clear. It uses the idea of *indifference curves* — technically speaking, curves where the rate of change in the noncentrality parameter w.r.t. the number of labs is equal to the rate of change w.r.t. participants per lab. The way these indifference curves are plotted below, studies that are *below* the relevant indifference curve would get a greater power benefit from increasing the number of labs, and studies that are *above* the relevant indifference curve would get a greater power benefit from increasing the number of participants per lab. For studies that lie exactly on the indifference curve, power would increase just as fast by increasing the number of labs as by increasing the number of participants per lab.

As you can see, most of the time there is a greater benefit (i.e., statistical power will increase faster) to increasing the number of labs. This is especially true if there is a lot of Lab variance. But it tends to be true even when there is little Lab variance. The cases where it makes more sense to increase the number of participants per lab are when you already have a large number of labs but a small number of participants per lab. And let’s face it, your multi-lab replication project is probably not in this part of the space.

This one is important. We tend to assume that as long as we continue to recruit participants, eventually we will have high statistical power. So even if we didn’t recruit as many labs for our project as we had hoped, we should be able to compensate for this by just recruiting more participants per lab — right? Unfortunately it isn’t so. The truth is that if we hold constant the number of labs, statistical power does *not* approach 100% as the number of participants per lab approaches infinity. Instead, power approaches some *maximum attainable power value* that can possibly be quite small, depending on the effect size, number of labs, and the Lab variance.

This can actually be seen pretty easily by considering again the expression of the noncentrality parameter:

In the limit as approaches infinity, the term in the denominator involving disappears, but the term involving does not, so the whole noncentrality parameter converges to a finite and possibly small value. Here’s what the situation looks like graphically for a handful of representative values of the variables:

The curves in each panel are unlabeled because, honestly, they don’t really matter. (If you want to know the power values for specific parameter combinations, the first graph is better for that anyway.) The point is just to show that when we increase the number of labs, power does eventually approach 100%, and the values of the other variables simply affect how quickly this happens. But when we increase the number of participants per lab — but, crucially, hold constant the number of labs — power *sometimes* approaches a value close to 100%, but often does not. The maximum attainable power will tend to be low when (a) the effect size is small, (b) the number of labs is small, (c) the Lab variance is high.

The main conclusion is pretty simple. Multi-lab replication projects are great, and you should do them… but when you’re designing them, you should really try hard to recruit as many labs to collaborate in the project as you possibly can. You actually don’t need that many participants from each lab — the first figure shows that unless the Lab variance is tiny, you get quite diminished returns by recruiting more than 100 or so participants per lab — so perhaps this is a selling point that you can use to convince more labs to join your project (“we would only need you to run a few dozen participants!”).

If you want to do your own power analyses for multi-lab designs like this, or for other more complicated designs, you can use my PANGEA app.

]]>(Update: Jeff posted his answer to the effect size puzzler here.)

As it happens, I’ve been meaning for a while to write a blog post on issues with computing d-like effect sizes (or other standardized effect sizes) for within-subject designs, so now seems like a good time to finally hammer out the post.

Jeff didn’t actually say anything to restrict us to standardized mean difference type measures (as opposed to, say, variance explained type measures), and we can only guess whether the “correct” effect size he has in mind is a d-like measure or an -like measure or what. But here I’ll focus on d-like measures, which are perhaps the most popular for studies with categorical predictors, and offer plenty of complications that I think are often under-appreciated (and it seems Jeff would agree).

What I’ll show here is that there are at least 5 different and non-equivalent ways that people might compute a d-like effect size (which they would invariably simply call “Cohen’s d”) for Jeff’s dataset, and the resulting effect sizes range from about **0.25** to **1.91**. I’ll compare and contrast these procedures and ultimately choose one that I think is the least crazy, if you must compute a standardized effect size (more on that later). When I read a paper that reports “Cohen’s d” for a within-subject design, I usually have no idea which of these 5 procedures the authors actually applied unless I try to reproduce the effect size myself from some of the descriptives in the paper, which is often not possible.

Jeff’s dataset can be downloaded here. Here’s some code to load it into R, examine it, print the two conditions means, and save the mean difference as

`md`

:

This is essentially a straightforward application of the formula provided by Cohen himself:

The numerator is the mean difference. Cohen never actually gives a clear, unambiguous definition of in the denominator, but it is usually taken to be the pooled standard deviation, that is, the square root of the average variance in each condition (assuming equal sample sizes in both conditions). In R code:

```
(d <- md / with(dat, sqrt(mean(tapply(rt, cond, var)))))
# 0.2497971
```

The crucial thing to recognize about applying the classical Cohen’s d is that it deliberately ignores information about the design of the study. That is, you compute d the same way whether you are dealing with a between-subjects, within-subjects, or mixed design. Basically, in computing d, you always treat the data as if it came from a simple two-independent-groups design. I don’t want to get bogged down by a discussion of why that is a good thing at this point in the post—I’ll get into that later on. For now I just note that, with this effect size, within-subject designs tend to be powerful *not* because they lead to larger effect sizes—if anything, the reverse is probably true, in that people elect to use within-subject designs when Cohen’s d is particularly small, for example in many reaction time studies—but rather because they allow us to efficiently detect smaller effect sizes due to removing irrelevant sources of variation from the denominator of the test statistic.

For this next way of computing a d-like effect size, the section heading pretty much says it all. In Jeff’s dataset, each participant makes a total of 100 responses, 50 in each condition. The computation of proceeds by first averaging over the 50 responses in each subject-by-condition cell of the experiment, so that each subject’s data is reduced to 2 means, and then applying the classical d formula to this aggregated data. In R code:

```
sub_means <- with(dat, tapply(rt, list(id, cond), mean))
(d_a <- md / sqrt(mean(diag(var(sub_means)))))
# 0.8357347
```

Richard Morey blogged about some weird things that can happen when you compute a standardized effect size this way. The basic issue is that, unlike classical Cohen’s d, does not ignore all the design information: will tend to be larger when there are more replicates, that is, when each subject responds more frequently in each condition.

A third way to compute a d-like effect size is to reduce each subject’s data to a single difference score—the mean difference between their responses in each condition—and then use the standard deviation of these difference scores as the denominator of d. Cohen actually discusses this statistic in his power analysis textbook (Cohen, 1988, p. 48), where he carefully distinguishes it from the classical Cohen’s d by calling it . In R, we can compute this as:

```
(d_z <- md / sd(sub_means[,2] - sub_means[,1]))
# 1.353713
```

There is a straightforward relationship between and the test statistic: , where is the paired-samples t-statistic from a within-subjects design and is the number of subjects. One might regard this as a virtue of . I will argue below that I don’t think it’s a good idea to use .

In a simple two-independent groups design, one can compute the classical Cohen’s d from the t-statistic using

,

where is the independent-samples t-statistic for a between-subjects design and is the number of subjects per group. Many, many authors over the years have incorrectly assumed that this same conversion formula will yield sensible results for other designs as well, such as in the present within-subjects case. Dunlap et al. (1996) wrote a whole paper about this issue. One might suppose that applying this conversion formula to will yield , but we can see that this is not the case by solving the equation given in the previous section for , which yields . In other words, naive application of the between-subjects conversion formula yields an effect size that is off by a factor of .

To compute in R:

```
t_stat <- t.test(sub_means[,2] - sub_means[,1])$statistic
(d_t <- t_stat*sqrt(2/nrow(sub_means)))
# 1.914439
```

Finally, one can compute a d-like effect size for this within-subject design by assuming that the in the classical Cohen’s d formula refers to the standard deviation of the residuals. This is the approach taken in Rouder et al. (2012) on Bayes factors for ANOVA designs. In between-subjects designs where each subject contributes a single response, this is equivalent to classical Cohen’s d. But it differs from classical Cohen’s d in designs where subjects contribute multiple responses.

To compute in R, we need an estimate of the residual standard deviation. Two ways to obtain this are from a full ANOVA decomposition of the data or by fitting a linear mixed model to the data. Here I do it the mixed model way:

```
options(contrasts=c("contr.helmert","contr.poly"))
library("lme4")
mod <- lmer(rt ~ cond + (cond||id), data=dat)
summary(mod)
# Random effects:
# Groups Name Variance Std.Dev.
# id (Intercept) 0.0026469 0.05145
# id.1 cond 0.0001887 0.01374
# Residual 0.0407839 0.20195
# Number of obs: 2500, groups: id, 25
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 0.779958 0.016402 47.55
# cond 0.052298 0.008532 6.13
```

The residual standard deviation is estimated as 0.20195, which gives us = **0.259**. It turns out that, for this dataset, this is quite close to the classical Cohen’s d, which was 0.25. Basically, classical Cohen’s d is equivalent to using the square root of *the sum of all the variance components* in the denominator^{1,2}, rather than just the square root of the residual variance as uses. For this simulated dataset, the two additional variance components (intercepts and slopes varying randomly across subjects) are quite small compared to the residual variance, so adding them to the denominator of the effect size does not change it much. But the important thing to note is that for other datasets, it is possible that and could differ dramatically.

Contrary to Jeff, I don’t really think there’s a “correct” answer here. (Well, maybe we can say that it’s hard to see any justification for computing .) As I put it in the comments to an earlier blog post:

Basically all standardized effect sizes are just made-up quantities that we use because we think they have more sensible and desirable properties for certain purposes than the unstandardized effects. For a given unstandardized effect, there are any number of ways we could “standardize” that effect, and the only real basis we have for choosing among these different effect size definitions is in choosing the one that has the most sensible derivation and the most desirable properties relative to other candidates.

I believe that classical Cohen’s d is the option that makes the most sense among these candidates. Indeed, in my dissertation I proposed a general definition of d that I claim is the most natural generalization of Cohen’s d to the class of general ANOVA designs, and I considered it very important that it reduce to the classical Cohen’s d for datasets like Jeff’s. My reasoning is this. One of the primary motivations for using standardized effect sizes at all is so that we can try to meaningfully compare effects from different studies, including studies that might use different designs. But all of the effect size candidates other than classical Cohen’s d are affected by the experimental design; that is, the “same” effect will have a larger or smaller effect size based on whether we used a between- or within-subjects design, how many responses we required each subject to make, and so on. Precisely because of this, we cannot meaningfully compare these effect sizes across different experimental designs. Because classical Cohen’s d deliberately ignores design information, it is at least *in-principle* possible to compare effect sizes across different designs. Morris and DeShon (2002) is a nice paper that talks about these issues. Bakeman (2005) also has some great discussion of essentially this same issue, focused instead on “variance explained”-type effect sizes.

Although I don’t really want to say that there’s a “correct” answer about which effect size to use, I will say that if you choose to compute , , or anything other than classical Cohen’s d, just please do not call it Cohen’s d. If you think these other effect sizes are useful, fine, but they are ** not** the d statistic defined by Cohen! This kind of mislabeling is how we’ve ended up with 5 different ways of computing “Cohen’s d” for within-subjects designs.

Finally, there is a serious discussion to be had about whether it is a good idea to routinely summarize results in terms of Cohen’s d or other standardized effect sizes at all, even in less complicated cases such as simple between-subjects designs. Thom Baguley has a nice paper with some thoughtful criticism of standardized effect sizes, and Jan Vanhove has written a couple of nice blog posts about it. Even Tukey seemed dissatisfied with the enterprise. In my opinion, standardized effect sizes are generally a bad idea for data summary and meta-analytic purposes. It’s hard to imagine a cumulative science built on standardized effect sizes, rather than on effects expressed in terms of psychologically meaningful units. With that said, I do think standardized effect sizes can be useful for doing power analysis or for defining reasonably informative priors when you don’t have previous experimental data.

Footnotes

^{1} In general, this is really a weighted sum where the variance components due to random slopes must be multiplied by a term that depends on the contrast codes that were used. Because I used contrast codes of -1 and +1, it works out to simply be the sum of the variance components here, which is precisely why I changed the default contrasts before fitting the mixed model. But be aware that for other contrast codes, it won’t simply be the sum of the variance components. For more info, see pp. 20-21 of my dissertation.

^{2} If you actually compute classical Cohen’s d using the square root of the sum of the estimated variance components, you will find that there is a very slight numerical difference between this and the way we computed d in the earlier section (0.2498 vs. 0.2504). These two computations are technically slightly different, although they estimate the same quantity and should be asymptotically equivalent. In practice the numerical differences are negligible, and it is usually easier to compute d the previous way, that is, without having to fit a mixed model.

Researchers often feel uneasy about using power analysis to design their actual experiments because of uncertainty about the effect size in the study to be run. A common sentiment that one hears goes something like:

“I can’t do a power analysis because I have no idea what the effect size is. If I knew the effect size, I wouldn't have to run the study in the first place!”

The implication of this view is that, unless one has actually done experiments in the past that are pretty similar to the one being considered, there is otherwise no justifiable basis for making any particular assumptions about the effect size in the present study. In order to have a good idea about the effect size, the argument goes, we have to actually run the study, at which point obviously the power analysis is no longer needed. Convinced by this reasoning, many researchers throw up their hands, decide that power analysis will not be useful here or perhaps ever, and just plan instead on collecting some loosely conventional sample size that depends on their research area, but is usually something like 20-30 observations per cell of the design. In other words, they fight the power.

I’m here to convince you that fighting the power is a self-defeating research habit.

The first premise of the argument against power analysis is that we know little or nothing about the effect size before the study has been run. On the contrary. In the year 2015 we can benefit from decades of meta-analyses that have summarized the typical effect sizes found in almost any imaginable corner of the research literature. We even have meta-meta-analyses of those meta-analyses. The effect size in your future study is likely to resemble the effect sizes of the past, and luckily for us, the meta-analytic data on typical effect sizes are vast.

I want to illustrate just how good our situation really is by considering what is probably our worst case scenario in terms of study design: The case where we know *absolutely nothing whatsoever* about the study to be run except that the topic matter could broadly be classified as “social psychology” or some related field. In that case, we can use the data from Richard, Bond, and Stokes-Zoota (2003), who conducted a meta-analysis of meta-analyses in the field of social psychology to determine the range of typical effect sizes across the entire field, involving some 25,000 individual studies published over 100 years in diverse research areas. While the focus of this meta-meta-analysis was the field of social psychology, I believe there is little reason to expect the distribution of typical effect sizes to be appreciably different in other areas of psychology, such as cognitive psychology (if you are aware of large-scale meta-analytic data to the contrary, please let me know). Anyway, the figure below summarizes the distribution of effect sizes that they found.

Their meta-analysis actually examined the effects on the Pearson’s *r* (correlation) scale, and the bumpy density curve in the left panel shows their aggregated data (copied/pasted from their Figure 1). The smooth curve overlaying that data is the best-fitting beta distribution^{1}, on which the percentiles and other statistics are based, and the curve in the right panel is based on applying a standard conversion formula to the smooth curve in the left panel^{2}.

What this shows is that, *in the absence of any other information about the study you are about to run*, a pretty reasonable assumption about the effect size is that it is equal to the historical average: *r *= 0.21 or *d* = 0.45. Or you could use the median, or be conservative and go with the 30th percentile, or whatever you want. The point is, we have enough information to make a pretty well-informed decision even if we have no specific information at all about the actual study.

Of course, in most cases in the real world, you probably do know something about the study you are about to run. In almost all cases, that knowledge will allow you to make an even more refined estimate of the effect size, either by finding a meta-analysis that looks more specifically at effects that are conceptually similar to yours (you could even start with Richard et al., who helpfully break down the average effect size in social psychology by broad research area), or just by starting with the aggregate historical estimate and adjusting from there based on how you think your study differs from the average study of the past.

The argument that opened this post pointed out that we don’t know the effect size before the study has been run. That’s true, but of course, we don’t know the effect size after the study has been run either. Instead what we have is some data from which we can construct an estimate of the effect size. Realizing this allows us to ask the quantitative question: Just how good of an effect size estimate do we have at the end of a typically-sized experiment? If our estimate of the effect size after an initial study is not much better than what we could already surmise based on the historical, meta-analytic data, then it doesn’t make a lot of sense to trust the former a lot more than the latter.

Consider a typical study in which we compare two independent groups with n=30 participants per group. Below I’ve simulated some data in which the standardized mean difference between the two groups is exactly equal to the historical average of d = 0.45. The figure below shows a bootstrap sampling distribution of the effect size in this situation^{3}. If we ignore all prior information that we have about typical values of the effect size, as many researchers routinely do, then this sampling distribution summarizes everything we know about the effect size after running a pretty typical study.

Compare this distribution to the right panel of the first Figure from above, which showed our prior knowledge about the likely values of *d*. In terms of how much information they carry about *d*, the two distributions are really not that different. The sampling distribution is slightly less variable—it has a standard deviation of 0.27 rather than 0.37—but this difference in variability is quite hard to see from visual inspection.

Many researchers are hesitant to rely on power analyses based on historical information about *d*, but feel much more comfortable relying on a power analysis based on the sample *d* from an initial study like this. I submit that this doesn’t make a lot of sense because we don’t really have much more information about the likely values of *d* after running the initial study than we already had about *d* before running a single participant. In fact, for small pilot studies, the sampling distribution of *d* might actually be more variable than our prior distribution for *d* based on historical, meta-analytic data. Below is a figure that compares the variability of sample *d* (as a function of total sample size *N*) to the variability of our prior beliefs about *d*. We can see that the two distributions have the same standard deviation at approximately *N*=30. One way to view this is that, in the absence of any other information about the study to be run, we have about as much prior information about *d* as if we had run a pilot study with *N*=30 (and ignored all prior information rather than updating our beliefs about *d* in a Bayesian fashion).

Whether we use historical data or data from previous experiments we have run, there will always be some uncertainty about the effect size. So there are a range of plausible assumptions we could make about the effect size when doing a power analysis, and these different assumptions imply different sample sizes to collect in the study. In many cases, the uncertainty will be pretty high, so that the range of recommended sample sizes will be quite wide, a fact which many researchers find disconcerting.

Uncertainty is a fact of scientific life and should be no cause for dismay. We have all (hopefully) learned to be comfortable with uncertainty in other aspects of the research process. Unfortunately, many researchers seem oddly unwilling to accept even modest uncertainty in the planning phase of the research. In responding to such a view, it’s hard to put it better than @gung did in this answer on Cross Validated:

“Regarding the broader claim that power analyses (a-priori or otherwise) rely on assumptions, it is not clear what to make of that argument. Of course they do. So does everything else. Not running a power analysis, but just gathering an amount of data based on a number you picked out of a hat, and then analyzing your data, will not improve the situation.”

Uncertainty is there whether we like it or not. We should try to make the best design decisions possible in light of that uncertainty. Power analysis is our best tool for doing so. Before I close the post, let me clarify: In my opinion, there is nothing wrong with planning experiments based on rules of thumb. I acknowledge that much of the time it won’t make sense to do a formal power analysis for each and every experiment, because often we won’t have a lot of specific information about the particular study we’re about to run beyond the kind of general information we have about the typical experiments we tend to run. My point is that we should apply statistically well-informed rules of thumb that are based on historical, meta-analytic data, and are calibrated to work pretty well in a range of realistic research situations—not dubious heuristics like an *n*=30 rule. One of the most important functions of power analysis is to help us construct such good rules of thumb.

^{1} For those interested, the parameters of this beta distribution are about [math]\alpha=1.34, \beta=5.03[/math].

^{2} The correct conversion from Pearson’s *r* to Cohen’s *d* depends on the assumed proportion of participants in the two groups. The statistics that I present in the figure are based on the standard formula that assumes the group sizes are equal. I experimented with various ways of relaxing that assumption in a realistic manner, but ultimately found that the difference was negligible unless one assumes the group sizes tend to be markedly and unrealistically unequal.

^{3} The mean shown in the figure is the mean of the bootstrap distribution. This mean is slightly higher than the assumed value of 0.45 because the sampling distribution of d is slightly positively skewed, reflecting the fact that sample d is a slightly positively biased estimate of population d.

^{4} Thanks to Katie Wolsiefer for this figure caption, which is way better than my original.

Some people on twitter (and in my blog comments) remarked that my conclusion appears to fly in the face of some things Uri Simonsohn wrote on a similar topic not too long ago. Briefly, Uri writes of a Study 1 in which there are two independent groups (A1 and B1) and some non-zero effect size, and a Study 2 in which we move to a 2×2 design in which the difference between conditions A1 and B1 is the same size as before, but the difference between the other two conditions A2 and B2 is exactly 0, and we are now testing the interaction effect. Uri concludes: “To obtain the same level of power as in Study 1, Study 2 needs at least twice as many subjects, per cell, as Study 1.” Let’s call this Uri’s 2*n* rule*.

I thought Uri’s post was cool and certainly don’t think my point contradicts the point he made. The important thing to note here is that the effect size that Uri assumes to be constant in both studies is just the A1-B1 difference. But that’s not the effect we’re actually testing in Study 2: we’re testing the interaction effect, i.e., the A1 – B1 – A2 + B2 difference. And there is no assumption that the effect size for the interaction in Study 2 is equal to the effect size for the A1-B1 difference in Study 1. In fact, the 2*n* rule depends on the relevant effect size in Study 2 being *half* the relevant effect size in Study 1. That’s why you must increase the sample size when moving to Study 2; for the situation Uri’s talking about, the relevant effect size gets cut in half in Study 2. In my post I’m talking about cases where the relevant effect size is held constant. If the relevant effect size is held constant, then adding cells to the design has a negligible impact on power.

**Numerical example**

Consider the following table of cell means and standard deviations (the latter in parentheses).

Let’s say there are 20 subjects in each cell. Now if Study 1 involves only groups A1 and B1 (so that there is *N*=40 in total) then the power of the study is 34%. And if Study 2 involves all four groups (so that there is *N*=80 in total), then the power to detect the interaction effect is only 20%. But if we double the sample size (so that there is *n*=40 in each cell and *N*=160 in total), then the power to detect the interaction effect is 34%, just as it was in Study 1. This is the 2*n* rule that Uri wrote about, and I don’t dispute it.

But now let’s look at the standardized effect sizes for the two studies. We use Cohen’s *d*, defined as , where and are the two means being compared and is the pooled standard deviation, or equivalently, the root mean squared error (RMSE). Computing this in Study 1 is straightforward since there are only two groups; we have .

Computing *d* in Study 2 is a little less clear since in that case we have four groups and not two. We saw above that the relevant mean difference is the A1 – B1 – A2 + B2 difference. The key here is to realize that the interaction effect essentially still comes down to a comparison of two groups: We are comparing the A1 and B2 groups (which have coefficients of +1 in this difference score) against the A2 and B1 groups (which have coefficients of -1). So the two relevant means to use in computing *d* are the mean of the A1 and B2 means, and the mean of the A2 and B1 means. This gives us for the interaction effect. In other words, the relevant effect size in Study 2 is half of what the relevant effect size in Study 1 was.

It’s easy to show this symbolically. Let be the effect size in Study 1 and be the effect size in Study 2. Then, starting with the classical definition of *d,*

.

In this example we’ve assumed the difference is 0, so that

.

* Robert Abelson wrote about a similar phenomenon which he called the 42% rule: If the A1-B1 difference just reaches significance at *p*=.05 with size *d*, then the A2-B2 difference has to be at least 42% as large as *d* in the *opposite direction* for the interaction in Study 2 to reach p=.05 with the same sample size.

**The example**

The idea of writing about this came from a recent interaction with a colleague, which interaction I describe here in slightly simplified form. My colleague was reviewing a paper and asked my advice about something the authors wrote concerning the power of the study. The study involved a 2×2×2 between-subjects design with a total sample size of *N*=128, and the authors had remarked in their manuscript that such a study should have statistical power of about 80% to detect a canonical “medium” effect size of Cohen’s *d* = 0.5.

This did not seem right to my colleague: “This seems impossible. I recently read a paper that clearly said that, in a simple two-independent-groups design, one needs about *n*=64 in each group to have 80% power to detect a medium effect size. But in this study the sample size works out to only *n*=16 in each cell! Intuitively it seems like the required sample size for a given power level should increase as the design becomes more complicated, or at least stay the same, but definitely not *decrease*. So their power calculation must be flawed, right?”

The intuition seems reasonable enough when stated this way, but in fact it isn’t true. The problem here is the assumption that the relevant sample size for power purposes is the sample size per cell, small *n*.

**Big N vs. small n**

Mathematically, it’s a little tricky to say that power is a function of *N* rather than *n*. After all, one can write the relevant equations in terms of *N* or in terms of *n*, so in that sense power is a function of whichever one you prefer. The argument here is that power is a much more natural and simple function of *N* than it is a function of *n*, so that rules of thumb based on *N* are far more useful than rules of thumb based on *n*.

One could justify this argument rather formally by looking at the symbolic expression of the noncentrality parameter written in different ways. But really the most straightforward and probably most compelling way to see that it’s true is just to specify two sample size rules, one based on *N* and one based on *n*, and to compare the statistical power resulting from each rule for a few different designs and effect sizes, which is what I’ve done in the table below.

Here I’ve chosen an *N*=128 rule just to be consistent with the example from before, but the general conclusion is clear. Using a rule based on *N*, power is a pretty simple and well-behaved function of the effect size alone, regardless of the particular between-subjects factorial design being considered. Using a rule based on *n*, power remains a more complicated, joint function of the effect size and the factorial structure.

**Final caveat / technical footnote**

Here, for the sake of simplicity, I’ve restricted myself to examining designs where all the factors have 2 levels, sometimes called 2* ^{k}* factorials. In between-subjects factorials where some of the factors have >2 levels, the appropriate sample size rule is slightly more complicated in that it depends on the particular contrast being tested. In these cases, a rule based on the total number of observations that are actually involved in the contrast—which we might call