Category Archives: missing data

Using causal graphs to understand missingness and how to deal with it

If you’re reading this, you probably know that missing data can cause a lot of problems in a data analysis, from reduced efficiency at best to seriously mistaken conclusions at worst. You may even be able to recite the technical definitions of Rubin’s three types of missing data. Those definitions are all well and good, but if you’re like me, you don’t necessarily have such an easy time applying that knowledge to concrete data situations and determining how to proceed.

Well, I just discovered a new tool for reasoning about missingness that I’m pretty excited about, and I’d like to share it with you. Basically the idea is to use causal graphs to graphically represent our assumptions about the patterns of causality among the observed variables and their missingness mechanisms. And then we can visually follow simple path-tracing algorithms along the graph to answer questions like

  • What parameters (such as regression coefficients or population means) can I recover—meaning construct estimates that are consistent with what we would get with no missing data—by just analyzing the non-missing cases (i.e., listwise deletion)? This is called ignorable missingness.
  • For missingness that is ignorable, which auxiliary variables should I condition on (e.g., statistically adjust for or include in an imputation model), and which should I actually avoid conditioning on?

This approach works so well because causal graphs are all about deducing which observed variables should be conditionally independent given which other variables—and it so happens that Rubin’s theory of missing data is naturally phrased in terms of conditional independencies.

Directed Acyclic Graphs (DAGs)

There’s a lot to say about DAGs… frankly too much to fit in one tutorial-style blog post. For a crash course in what they are and how the work, I’d recommend the resources listed here. In the interest of keeping this post as concise as possible, I’ll assume you either have some prior familiarity with DAGs or that you’ve perused some of the crash course material I just referenced, but I will at least provide brief reminders about key concepts as they are needed.

The big idea

The big idea is to augment the usual DAGs by adding nodes that represent the missingness in any variables that contain missing values, creating a so-called m-graph. For example, here’s a DAG for a situation where we have a predictor of interest X, an outcome Y, and two concomitant variables W and Z:

Figure 1: DAG without missingness

The causal model underlying this DAG assumes that

  • X directly causes Y and Z
  • Z and Y are associated through an unobserved confounder1
  • W causes Y independently of X.

Now we suppose that Y and X contain missing values, so we add corresponding missingness nodes R_Y and R_X. Specifically, we create the following m-graph:

Figure 2: An m-graph

What are we saying about the missingness in Y and X when we posit this causal model? We are saying that

  • The missingness in X is directly caused by the (partially observed) values of X itself. An example of this would be if high-income respondents to a survey were less likely to disclose their incomes specifically because they didn’t want their high income to be known. In other words, high values are missing because they are high values. In Rubin’s terminology, X is missing not at random (MNAR).
  • The missingness in Y is statistically associated with the values of Y, but this association is entirely due to the common cause W. An example of this would be if a high level of education (W) both causes income (Y) and causes respondents to (for some reason) be less likely report their income (R_Y). In Rubin’s terminology, Y is missing at random (MAR) given W.2

Now that we’ve laid out our assumptions about how missingness is related to the relevant observed and unobserved variables, we can apply relatively simple graphical criteria to help answer the kinds of questions I laid out near the beginning of this post. These criteria are due to the amazing work of Karthika Mohan and colleagues (see References). Here I’m only going to focus on the conditions for recovering regression coefficients, but this research also lays out criteria for recovering the full joint, conditional, or marginal distributions of all variables in the graph.

Conditions for recovering regression coefficients

We have separate necessary or sufficient conditions for recovering the coefficients from a regression predicting Y. A necessary condition is:

  1. Y (the outcome) cannot directly cause R_Y (or vice versa, although that would be strange).

We can see in the m-graph that Condition 1 is satisfied. Although Y and R_Y are statistically related (through their common cause W), Y does not directly cause R_Y. The predictor X does directly cause its missingness node R_X, but this doesn’t matter for estimating a regression of Y on X. To give an intuition about why we can have MNAR in X but not Y, consider this figure from Daniel et al. (2012):

Figure 3: We can have MNAR in the predictor A, but not the outcome Y. (From Daniel et al., 2012)

For simplicity, the figure considers a crude missing data mechanism where the observation is simply removed if it exceeds some fixed value (i.e., truncated). In the top panel (a), the missingness depends directly on the value of the predictor A. Despite this, we can still recover the slope simply by analyzing the complete cases. In the bottom panel (b), the missingness depends directly on the value of the outcome Y. As the figure shows, this has a distorting influence on the slope which precludes the consistent recovery of the regression coefficient.

Meeting the necessary Condition 1 alone does not guarantee that we can recover the regression coefficient of interest; it just says that it might be possible. A sufficient condition for recoverability of the regression of Y on p predictors X_1, X_2, \ldots, X_p is:

  1. Y is d-separated from the missingness nodes R_Y, R_{X1}, R_{X2}, \ldots, R_{Xp} by the predictors X_1, X_2, \ldots, X_p.

Recall that two nodes A and B are d-separated by C if conditioning on C blocks all open paths between A and B (keeping in mind that colliders act “in reverse”: colliders block open paths unless they are conditioned on). If Condition 2 is satisfied, then we can recover the regression of Y on X_1, X_2, \ldots, X_p simply by analyzing the non-missing observations at hand.

Glancing at the m-graph in Figure 2 we can see that the regression of Y on X does not meet Condition 2, because Y and R_Y remain d-connected through W, which is not in the set of predictors.

So basically we have two options for how to proceed. The first and easier option is to consider instead a regression of Y on both X and W. Essentially, we decide that rather than seeking \beta_{YX}, the simple regression coefficient of Y on X, we will settle instead for \beta_{YX.W}, the partial regression coefficient that adjusts for W. In that case, Condition 2 is satisfied because this new predictor W d-separates Y from R_Y. So we could recover the coefficients from this new multiple regression simply by analyzing the non-missing observations at hand.

The second option is for when we really do want \beta_{YX} and not \beta_{YX.W}. In the graph from Figures 1 and 2, there’s not really any good reason for preferring one over the other, since the graph implies that X and W are independent in the population and thus \beta_{YX} = \beta_{YX.W}.3 So in that case we may as well just estimate \beta_{YX.W}, since that would satisfy Condition 2. But what if the situation were slightly different, say, with X being a cause of W?

Figure 4: W is an effect of X.

In this new graph, the path X \rightarrow W \rightarrow Y is part of the total causal effect of X on Y, so we don’t want to condition on W. In this case, instead of Condition 2 we consider a new condition that allows for a set of q auxiliary variables A_1, A_2, \ldots, A_q that help d-separate Y from the relevant missingness nodes, but that are not conditioned on in the regression of Y on the predictors X_1, X_2, \ldots, X_p. The (sufficient) condition is:

  1. Y is d-separated from the missingness nodes R_Y, R_{X1}, R_{X2}, \ldots, R_{Xp}, R_{A1}, R_{A2}, \ldots, R_{Aq} by the set of predictors and auxiliary variables X_1, X_2, \ldots, X_p, A_1, A_2, \ldots, A_q. Furthermore, for any auxiliary variable A_i that contains missing values, we can recover the coefficient from regressing A_i on the predictors X_1, X_2, \ldots, X_p by Condition 2 or by a recursive application of Condition 3.

If Condition 3 is satisfied, then \beta_{YX} is recoverable by constructing an estimate from (a) the regression of Y on the predictors and auxiliary variables and (b) the regressions of each auxiliary variable on the predictors. For example, in the simple case where the structural equation for Y is linear and additive in X and W, we can use the well-known fact that

\underbrace{\beta_{YX}}_{\text{total effect}} = \underbrace{\beta_{YX.W}}_{\text{direct effect}} + \underbrace{\beta_{WX}\beta_{YW.X}}_{\text{indirect effect}}

In the present case, conditioning on the auxiliary variable W satisfies Condition 3 because (a) W d-separates Y from R_Y, and (b) W contains no missing values.

Necessity, sufficiency, and functional forms

So now let’s return to the m-graph in Figure 2, where X does not cause W. We see that if we do not condition on W by adjusting for it in the regression model, then we still meet the necessary condition for recoverability (Condition 1), but we fail to meet the sufficient conditions for recoverability (Conditions 2 and 3). So we fall in a sort of in-between case… is the estimate of \beta_{YX} recoverable or not? (Note that we saw above that it is recoverable if we apply the procedure dictated by Condition 3, but here we are asking if we can “directly” recover the estimate just from the simple regression of Y on X among the non-missing cases.)

It’s impossible to have complete necessary-and-sufficient conditions in this situation because \beta_{YX} may or may not be directly recoverable, depending on the functional forms of the structural equations. Remember that a DAG is totally nonparametric in that the structural equations it specifies can have any functional form and the variables can have any distributions (within reason). So for the DAG in Figures 1 and 2, the regression of Y on X and W could be a linear, additive function of X and W, or an interactive/multiplicative function, or any crazy nonlinear function. It turns out that, in this particular case, \beta_{YX} is directly recoverable if the effects of X and W on Y are linear and additive, but is not directly recoverable if the regression involves an X \times W interaction. This figure should help to give some intuition about why that’s the case:

Figure 5: The additive and interactive models have the same DAG, but \beta_{YX} is only directly recoverable in the additive model. The W=0 group (black) has no missing data, while the W=1 group (red) has 50% missing data. Open circles represent missing data points.

For simplicity, Figure 5 considers an example where the auxiliary variable W is a binary dummy variable. In the additive example, simply analyzing the non-missing cases yields the same estimate (on average) as if all the data were fully observed—whether or not we condition on W. In the interactive example, the group with more missing data has a greater slope, so analyzing the non-missing cases skews the total estimate toward the group with the smaller slope. But by conditioning on W, we can either (a) recover the conditional effect of X on Y given W (through Condition 2) simply by analyzing the non-missing cases, or (b) recover the marginal effect of X on Y (through Condition 3) by taking a weighted mean of the regression lines at W=0 and W=1, with the weights proportional to the number of cases in each group prior to missingness (which we know, despite the missingness, since the missing values still have partial records on the other observed variables).

Concluding remarks

While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic. It’s often assumed that having data missing not at random (MNAR) basically trashes your analysis if a non-trivial fraction of data are missing, which is frequently true in observational data. But the necessary and sufficient conditions laid out here suggest that, in fact, the simple strategy of listwise deletion—simply “ignoring” the missingness and analyzing the non-missing observations that are at hand—yields robust estimates under a pretty wide range of missingness mechanisms. And even in a lot of cases where you can’t just ignore the missing data problem, you can often still construct a consistent estimate of your parameter of interest (as in Condition 3) without needing to use fancy procedures like multiple imputation or distributional modeling of the predictors. Even further, you may still be able to recover the parameter estimates of interest even if you can’t satisfy the sufficient conditions given here (as in the final example above). Finally, and more anecdotally, simulations I played around with while preparing this post suggest that, even when the regression coefficient of interest is not technically recoverable, the magnitude of bias is often small under many realistic conditions (see, for example, the magnitude of bias in the simulations of Thoemmes et al., 2015).

Footnotes and References

1 Recall that in DAGs, doubled-headed arrows like A \leftrightarrow B are just shorthand notation for A \leftarrow L \rightarrow B, indicating an unobserved common cause L.

2 Rubin’s third type of missingness, missing completely at random (MCAR), would be represented graphically by the relevant missingness node R being completely disconnected from all other nodes in the graph. An example of this would be if we flipped a coin for each observation and deleted the corresponding value when the coin landed heads. Although commonly assumed for convenience, MCAR is not actually very common in practice unless it is deliberately built into the study design.

3 Actually, while this is true in the special case of classical regression, it’s not true in general. See this later blog post of mine that discusses the fact that this property doesn’t hold in logistic regression.

Daniel, R. M., Kenward, M. G., Cousens, S. N., & De Stavola, B. L. (2012). Using causal diagrams to guide analysis in missing data problemsStatistical methods in medical research21(3), 243-256.

Mohan, K., Pearl, J., & Tian, J. (2013). Graphical models for inference with missing data. In Advances in neural information processing systems (pp. 1277-1285).

Mohan, K., & Pearl, J. (2014). Graphical models for recovering probabilistic and causal queries from missing data. In Advances in Neural Information Processing Systems (pp. 1520-1528).

Thoemmes, F., & Rose, N. (2014). A cautious note on auxiliary variables that can increase bias in missing data problemsMultivariate behavioral research49(5), 443-459.

Thoemmes, F., & Mohan, K. (2015). Graphical representation of missing data problemsStructural Equation Modeling: A Multidisciplinary Journal22(4), 631-642.