Category Archives: some things are not fucked

Logistic regression is not fucked

Summary: Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to. 

A few months ago I wrote a blog post on using causal graphs to understand missingness and how to deal with it, which concluded on a rather positive note:

“While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic…”

Well, today I write about another statistical issue with a basically optimistic message. So I decided this could be like the second installment in a “some things are not fucked” series, in which I look at a few issues/methods that some people have claimed are fucked, but which are in fact not fucked.1

In this installment we consider logistic regression — or, more generally, Binomial regression (including logit and probit regression).2

Wait, who said logistic regression was fucked?

The two most widely-read papers sounding the alarm bells seem to be Allison (1999) and Mood (2010). The alleged problems are stated most starkly by Mood (2010, pp. 67-68):

  1. “It is problematic to interpret [logistic regression coefficients] as substantive effects, because they also reflect unobserved heterogeneity.
  2. It is problematic to compare [logistic regression coefficients] across models with different independent variables, because the unobserved heterogeneity is likely to  vary across models.
  3. It is problematic to compare [logistic regression coefficients] across samples, across groups within samples, or over time—even when we use models with the same independent variables—because the unobserved heterogeneity can vary across the compared samples, groups, or points in time.”

These are pretty serious allegations.3  These concerns have convinced some people to abandon logistic regression in favor of the so-called linear probability model, which just means using classical regression directly on the binary outcome, although usually with heteroskedasticity-robust standard errors. To the extent that’s a bad idea—which, to be fair, is a matter of debate, but is probably ill-advised as a default method at the very least—it’s important that we set the record straight.

The allegations refer to “unobserved heterogeneity.” What exactly is this unobserved heterogeneity and where does it come from? There are basically two underlying lines of argument here that we must address. Both arguments lead to a similar conclusion—and previous sources have sometimes been a bit unclear by drawing from both arguments more or less interchangeably in order to reach this conclusion—but they rely on fundamentally distinct premises, so we must clearly distinguish these arguments and address them separately. The counterarguments I describe below are very much in the same spirit as those of Kuha and Mills (2017).

First argument: Heteroskedasticity in the latent outcome

A standard way of motivating the probit model for binary outcomes (e.g., from Wikipedia) is the following. We have an unobserved/latent outcome variable Y^* that is normally distributed, conditional on the predictor X. Specifically, the model for the ith observation is

Y_i^* = \beta_0^* + \beta_1^*X_i + \epsilon_i\text{, with }\varepsilon_i \sim \text{Normal}(0, \sigma^2).

The latent variable Y^* is subjected to a thresholding process, so that the discrete outcome we actually observe is

Y_i = \begin{cases} 1, & \text{if}\ Y_i^* \ge T \\ 0, & \text{if}\ Y_i^* < T\end{cases}.

This leads the probability of Y=1 given X to take the form of a Normal CDF, with mean and standard deviation a function of \beta_1^*, \sigma^2, and the deterministic threshold T. So the probit model is basically motivated as a way of estimating \beta_1^* from this latent regression of Y^* on X, although on a different scale. This latent variable interpretation is illustrated in the plot below, from Thissen & Orlando (2001). These authors are technically discussing the normal ogive model from item response theory, which looks pretty much like probit regression for our purposes.

Note the differences in notation: these authors use \theta in place of X, \gamma in place of T, u in place of Y, \mu in place of Y^*, and they write probabilities with T() instead of the usual P() or Pr().

We can interpret logistic regression in pretty much exactly the same way. The only difference is that now the unobserved continuous Y^* follows not a normal distribution, but a similarly bell-shaped logistic distribution given X. A theoretical argument for why Y^* might follow a logistic distribution rather than a normal distribution is not so clear, but since the resulting logistic curve looks essentially the same as the normal CDF for practical purposes (after some rescaling), it won’t tend to matter much in practice which model you use. The point is that both models have a fairly straightforward interpretation involving a continuous latent variable Y^* and an unobserved, deterministic threshold T.

So the first argument for logistic regression being fucked due to “unobserved heterogeneity” comes from asking: What if the residual variance \sigma^2 is not constant, as assumed in the model above, but instead is different at different values of X? Well, it turns out that, as far as our observable binary outcome Y is concerned, this heteroskedasticity can be totally indistinguishable from the typically assumed situation where \sigma^2 is constant and Y^* is increasing (or decreasing) with X. This is illustrated in Figure 2 below.

Figure 2.

Suppose we are comparing the proportions of positive responses (i.e., Y=1) between two groups of observations, Group 1 and Group 2. To give this at least a little bit of context, maybe Group 2 are human subjects of some social intervention that attempts to increase participation in local elections, Group 1 is a control group, the observed binary Y is whether the person voted in a recent election, and the latent continuous Y^* is some underlying “propensity to vote.” Now we observe a voting rate of 10% in the control group (Group 1) and 25% in the experimental group (Group 2). In terms of our latent variable model, we’d typically assume this puts us in Scenario A from Figure 2: The intervention increased people’s “propensity to vote” (Y^*) on average, which pushed more people over the threshold T in Group 2 than in Group 1, which led to a greater proportion of voters in Group 2.

The problem is that these observed voting proportions can be explained equally well by assuming that the intervention had 0 effect on the mean propensity to vote, but instead just led to greater variance in the propensities to vote. As illustrated in Scenario B of Figure 2, this could just as well have led the voting proportion to increase from 10% in the control group to 25% in the experimental group, and it’s a drastically different (and probably less appealing) conceptual interpretation of the results.

Another possibility that would fit the data equally well (but isn’t discussed as often) is that the intervention had no effect at all on the distribution of Y^*, but instead just lowered the threshold for Group 2, so that even people with a lower “propensity to vote” were able to drag themselves to the polls. This is illustrated in Scenario C of Figure 2.

So you can probably see how this supports the three allegations cited earlier. When we observe a non-zero estimate for a logistic regression coefficient, we can’t be sure this actually reflects a shift in the mean of the underlying continuous latent variable (e.g., increased propensity to vote), because it also reflects latent heteroskedasticity, and we can’t tell these two explanations apart. And because the degree of heteroskedasticity could easily differ between models, between samples, or over time, even comparing logistic regression coefficients to one another is problematic…if shifts in the underlying mean are what we care about.

Are shifts in an underlying mean what we care about?

There’s the crux of the matter. This entire first line of argument presupposes that all of the following are true for the case at hand:

  1. It makes any conceptual sense to think of the observed binary Y as arising from a latent continuous Y^* and a deterministic threshold.
  2. We actually care how much of the observed effect on Y is due to mean shifts in Y^* vs. changes in the variance of Y^*.
  3. We’ve observed only a single binary indicator of Y^*, so that Scenarios A and B from Figure 2 are empirically indistinguishable.

In my experience, usually at least one of these is false. For example, if the observed binary Y indicates survival of patients in a medical trial, what exactly would an underlying Y^* represent? It could make sense for the patients who survived—maybe it represents their general health or something—but surely all patients with Y=0 are equally dead! Returning to the voting example, we can probably grant that #1 is true: it probably does make conceptual sense to think about an underlying, continuous “propensity to vote.” But #2 is probably false: I couldn’t care less if the social intervention increased voting by increasing propensity to vote, spreading out the distribution of voting propensities, or just altering the threshold that turns the propensity into voting behavior… I just want people to vote!

Finally, when #1 and #2 are true, so that the investigator is primarily interested not in the observed Y but rather in some underlying latent Y^*, in my experience the investigator will usually have taken care to collect data on multiple binary indicators of Y^*—in other words, #3 will be false. For example, if I were interested in studying an abstract Y^* like “political engagement,” I would certainly view voting as a binary indicator of that, but I would also try to use data on things like whether that person donated money to political campaigns, whether they attended any political conventions, and so on. And when there are multiple binary indicators of Y^*, it then becomes possible to empirically distinguish Scenario A from Scenario B in Figure 2, using, for example, statistical methods from item response theory.

These counterarguments are not to say that this first line of argument is invalid or irrelevant. The premises do lead to the conclusion, and there are certainly situations where those premises are true. If you find yourself in one of those situations, where #1-#3 are all true, then you do need to heed the warnings of Allison (1999) and Mood (2010). The point of these counterarguments is to say that, far more often than not, at least one of the premises listed above will be false. And in those cases, logistic regression is not fucked.

Okay, great. But we’re not out of the woods yet. As I mentioned earlier, there’s a second line of argument that leads us to essentially the same conclusions, but that makes no reference whatsoever to a continuous latent Y^*.

Second argument: Omitted non-confounders in logistic regression

To frame the second argument, first think back to the classical regression model with two predictors, a focal predictor X and some covariate C:

Y_i = \beta^A_0 + \beta^A_1X_i + \beta^A_2C_i + \varepsilon^A_i.

Now suppose we haven’t observed the covariate C, so that the regression model we actually estimate is

Y_i = \beta^B_0 + \beta^B_1X_i + \varepsilon^B_i.

In the special case where C is uncorrelated with X, we know that \beta^B_1 =\beta^A_1, so that our estimate of the slope for X will, on average, be the same either way. The technical name for this property is collapsibility: classical regression coefficients are said to be collapsible over uncorrelated covariates.

It turns out that logistic regression coefficients do not have this collapsibility property. If a covariate that’s correlated with the binary outcome is omitted from the logistic regression equation, then the slopes for the remaining observed predictors will be affected, even if the omitted covariate is uncorrelated with the observed predictors. Specifically, in the case of omitting an uncorrelated covariate, the observed slopes will be driven toward 0 to some extent.

This is all illustrated below in Figure 3, where the covariate C is shown as a binary variable (color = red vs. blue) for the sake of simplicity.

Figure 3. In the classical regression case, the simple/unconditional regression line (black) has the same slope as the group-specific regression lines (red and blue). In the logistic regression case, they are not equal: the simple/unconditional regression line is much more shallow than the group-specific regression lines. In both cases there is no confounding: the predictor X and the grouping factor (color = red vs. blue) have zero correlation, that is, X has the same mean among the red points and the blue points.

So now we can lay out the second line of argument against logistic regression. Actually, the most impactful way to communicate the argument is not to list out the premises, but instead to use a sort of statistical intuition pump. Consider the data in the right-hand panel of Figure 3. The slope (logistic regression coefficient) of X on Y is, let’s say, \beta=2 for both the red group and the blue group. But suppose the color grouping factor is not observed, so that we can only fit the simple/unconditional logistic regression that ignores the color groups. Because of the non-collapsibility of logistic regression coefficients, the slope from this regression (shown in black in Figure 2) is shallower, say, \beta=1. But if the slope is \beta=2 among both the red and the blue points, and if every point is either red or blue, then who exactly does this \beta=1 slope apply to? What is the substantive interpretation of this slope?

For virtually every logistic regression model that we estimate in the real world, there will be some uncorrelated covariates that are statistically associated with the binary outcome, but that we couldn’t observe to include in the model. In other words, there’s always unobserved heterogeneity in our data on covariates we couldn’t measure. But then—the argument goes—how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?

These are rhetorical questions. The implication is that no meaningful interpretation is possible—or, as Mood (2010, p. 67) puts it, “it is problematic to interpret [logistic regression coefficients] as substantive effects.” I beg to differ. As I argue next, we can interpret logistic regression coefficients perfectly well even in the face of non-collapsibility.

Logistic regression coefficients are about conditional probabilities

Specifically, we can write logistic regression models directly analogous to models A and B from above as:

f\big(\text{Pr}(Y_i=1|X_i=x,C_i=c)\big) = \beta^A_0 + \beta^A_1x + \beta^A_2c,

f\big(\text{Pr}(Y_i=1|X_i=x)\big) = \beta^B_0 + \beta^B_1x,

where f(\cdot) is the logit link function. As the left-hand-sides of these regression equations make clear, \beta^A_1 tells us about differences in the probability of Y as X increases conditional on the covariate C being fixed at some value c, while \beta^B_1 tells us about differences in the probability of Y as X increases marginal over C. There is no reason to expect these two things to coincide in general unless \text{Pr}(Y|X,C)=\text{Pr}(Y|X), which we know from probability theory is only true when Y and C are conditionally independent given X—in terms of our model, when \beta^A_2=0.

So now let’s return to the red vs. blue example of Figure 3. We supposed, for illustration’s sake, a slope of \beta^B_1=1 overall, ignoring the red vs. blue grouping. Then the first rhetorical question from before asked, “who exactly does this \beta^B_1=1 slope apply to?” The answer is that it applies to a population in which we know the X values but we don’t know the C values, that is, we don’t know the color of any of the data points. There’s an intuition that if \beta^A_1=2 among both the red and blue points, then for any new point whose color we don’t know, we ought to guess that the slope that applies to them is also 2. But that presupposes that we were able to estimate slopes among both the red and blue groups, which would imply that we did observe the colors of at least some of the points. On the contrary, let me repeat: the \beta^B_1=1 slope applies to a population in which we know the X values but we don’t know any of the C values. Put more formally, the \beta^B_1=1 slope refers to changes in \text{Pr}(Y|X) = \text{E}_C\big[\text{Pr}(Y|X,C)\big]; there is an intuition that these probabilities ought to equal \text{Pr}\big(Y|X,C=\text{E}[C]\big), but these are not the same because the latter still require conditioning on C.

The second rhetorical question from above asked, “how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?” The answer is that we interpret them conditional on all and only the covariates that were included in the model. Again, conceptually speaking, the coefficients refer to a population in which we know the values of the covariates represented in the model and nothing more. There’s no problem with comparing these coefficients between samples or over time as long as these coefficients refer to the same population, that is, populations where the same sets of covariates are observed.

As for comparing coefficients between models with different covariates? Here we must agree with Mood and Allison that, in most cases, these comparisons are probably not informative. But this is not because of “unobserved heterogeneity.” It’s because these coefficients refer to different populations of units. In terms of models A and B from above, \beta^A_1 and \beta^B_1 represent completely different conceptual quantities and it’s a mistake to view estimates of \beta^B_1 as somehow being deficient estimates of \beta^A_1. As a more general rule, parameters from different models usually mean different things—compare them at your peril. In the logistic regression case, there may be situations where it makes sense to compare estimates of \beta^A_1 with estimates of \beta^B_1, but not because one thinks they ought to be estimating the same quantity.

Footnotes and References

1 Or which, at least, are not fucked for the given reasons, although they could still be fucked for unrelated reasons.
2 This stuff is also true for some survival analysis models, notably Cox regression.
3 At least, I think they are… a definition of “substantive effects” is never given (are they like causal effects?), but presumably they’re something we want in an interpretation.

Allison, P. D. (1999). Comparing logit and probit coefficients across groupsSociological methods & research28(2), 186-208.

Kuha, J., & Mills, C. (2017). On group comparisons with logistic regression modelsSociological Methods & Research, 0049124117747306.

Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about itEuropean sociological review26(1), 67-82.

Pang, M., Kaufman, J. S., & Platt, R. W. (2013). Studying noncollapsibility of the odds ratio with marginal structural and logistic regression modelsStatistical methods in medical research25(5), 1925-1937.

Rohwer, G. (2012). Estimating effects with logit models. NEPS Working Paper 10, German National Educational Panel Study, University of Bamberg.

Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Using causal graphs to understand missingness and how to deal with it

If you’re reading this, you probably know that missing data can cause a lot of problems in a data analysis, from reduced efficiency at best to seriously mistaken conclusions at worst. You may even be able to recite the technical definitions of Rubin’s three types of missing data. Those definitions are all well and good, but if you’re like me, you don’t necessarily have such an easy time applying that knowledge to concrete data situations and determining how to proceed.

Well, I just discovered a new tool for reasoning about missingness that I’m pretty excited about, and I’d like to share it with you. Basically the idea is to use causal graphs to graphically represent our assumptions about the patterns of causality among the observed variables and their missingness mechanisms. And then we can visually follow simple path-tracing algorithms along the graph to answer questions like

  • What parameters (such as regression coefficients or population means) can I recover—meaning construct estimates that are consistent with what we would get with no missing data—by just analyzing the non-missing cases (i.e., listwise deletion)? This is called ignorable missingness.
  • For missingness that is ignorable, which auxiliary variables should I condition on (e.g., statistically adjust for or include in an imputation model), and which should I actually avoid conditioning on?

This approach works so well because causal graphs are all about deducing which observed variables should be conditionally independent given which other variables—and it so happens that Rubin’s theory of missing data is naturally phrased in terms of conditional independencies.

Directed Acyclic Graphs (DAGs)

There’s a lot to say about DAGs… frankly too much to fit in one tutorial-style blog post. For a crash course in what they are and how the work, I’d recommend the resources listed here. In the interest of keeping this post as concise as possible, I’ll assume you either have some prior familiarity with DAGs or that you’ve perused some of the crash course material I just referenced, but I will at least provide brief reminders about key concepts as they are needed.

The big idea

The big idea is to augment the usual DAGs by adding nodes that represent the missingness in any variables that contain missing values, creating a so-called m-graph. For example, here’s a DAG for a situation where we have a predictor of interest X, an outcome Y, and two concomitant variables W and Z:

Figure 1: DAG without missingness

The causal model underlying this DAG assumes that

  • X directly causes Y and Z
  • Z and Y are associated through an unobserved confounder1
  • W causes Y independently of X.

Now we suppose that Y and X contain missing values, so we add corresponding missingness nodes R_Y and R_X. Specifically, we create the following m-graph:

Figure 2: An m-graph

What are we saying about the missingness in Y and X when we posit this causal model? We are saying that

  • The missingness in X is directly caused by the (partially observed) values of X itself. An example of this would be if high-income respondents to a survey were less likely to disclose their incomes specifically because they didn’t want their high income to be known. In other words, high values are missing because they are high values. In Rubin’s terminology, X is missing not at random (MNAR).
  • The missingness in Y is statistically associated with the values of Y, but this association is entirely due to the common cause W. An example of this would be if a high level of education (W) both causes income (Y) and causes respondents to (for some reason) be less likely report their income (R_Y). In Rubin’s terminology, Y is missing at random (MAR) given W.2

Now that we’ve laid out our assumptions about how missingness is related to the relevant observed and unobserved variables, we can apply relatively simple graphical criteria to help answer the kinds of questions I laid out near the beginning of this post. These criteria are due to the amazing work of Karthika Mohan and colleagues (see References). Here I’m only going to focus on the conditions for recovering regression coefficients, but this research also lays out criteria for recovering the full joint, conditional, or marginal distributions of all variables in the graph.

Conditions for recovering regression coefficients

We have separate necessary or sufficient conditions for recovering the coefficients from a regression predicting Y. A necessary condition is:

  1. Y (the outcome) cannot directly cause R_Y (or vice versa, although that would be strange).

We can see in the m-graph that Condition 1 is satisfied. Although Y and R_Y are statistically related (through their common cause W), Y does not directly cause R_Y. The predictor X does directly cause its missingness node R_X, but this doesn’t matter for estimating a regression of Y on X. To give an intuition about why we can have MNAR in X but not Y, consider this figure from Daniel et al. (2012):

Figure 3: We can have MNAR in the predictor A, but not the outcome Y. (From Daniel et al., 2012)

For simplicity, the figure considers a crude missing data mechanism where the observation is simply removed if it exceeds some fixed value (i.e., truncated). In the top panel (a), the missingness depends directly on the value of the predictor A. Despite this, we can still recover the slope simply by analyzing the complete cases. In the bottom panel (b), the missingness depends directly on the value of the outcome Y. As the figure shows, this has a distorting influence on the slope which precludes the consistent recovery of the regression coefficient.

Meeting the necessary Condition 1 alone does not guarantee that we can recover the regression coefficient of interest; it just says that it might be possible. A sufficient condition for recoverability of the regression of Y on p predictors X_1, X_2, \ldots, X_p is:

  1. Y is d-separated from the missingness nodes R_Y, R_{X1}, R_{X2}, \ldots, R_{Xp} by the predictors X_1, X_2, \ldots, X_p.

Recall that two nodes A and B are d-separated by C if conditioning on C blocks all open paths between A and B (keeping in mind that colliders act “in reverse”: colliders block open paths unless they are conditioned on). If Condition 2 is satisfied, then we can recover the regression of Y on X_1, X_2, \ldots, X_p simply by analyzing the non-missing observations at hand.

Glancing at the m-graph in Figure 2 we can see that the regression of Y on X does not meet Condition 2, because Y and R_Y remain d-connected through W, which is not in the set of predictors.

So basically we have two options for how to proceed. The first and easier option is to consider instead a regression of Y on both X and W. Essentially, we decide that rather than seeking \beta_{YX}, the simple regression coefficient of Y on X, we will settle instead for \beta_{YX.W}, the partial regression coefficient that adjusts for W. In that case, Condition 2 is satisfied because this new predictor W d-separates Y from R_Y. So we could recover the coefficients from this new multiple regression simply by analyzing the non-missing observations at hand.

The second option is for when we really do want \beta_{YX} and not \beta_{YX.W}. In the graph from Figures 1 and 2, there’s not really any good reason for preferring one over the other, since the graph implies that X and W are independent in the population and thus \beta_{YX} = \beta_{YX.W}. So in that case we may as well just estimate \beta_{YX.W}, since that would satisfy Condition 2. But what if the situation were slightly different, say, with X being a cause of W?

Figure 4: W is an effect of X.

In this new graph, the path X \rightarrow W \rightarrow Y is part of the total causal effect of X on Y, so we don’t want to condition on W. In this case, instead of Condition 2 we consider a new condition that allows for a set of q auxiliary variables A_1, A_2, \ldots, A_q that help d-separate Y from the relevant missingness nodes, but that are not conditioned on in the regression of Y on the predictors X_1, X_2, \ldots, X_p. The (sufficient) condition is:

  1. Y is d-separated from the missingness nodes R_Y, R_{X1}, R_{X2}, \ldots, R_{Xp}, R_{A1}, R_{A2}, \ldots, R_{Aq} by the set of predictors and auxiliary variables X_1, X_2, \ldots, X_p, A_1, A_2, \ldots, A_q. Furthermore, for any auxiliary variable A_i that contains missing values, we can recover the coefficient from regressing A_i on the predictors X_1, X_2, \ldots, X_p by Condition 2 or by a recursive application of Condition 3.

If Condition 3 is satisfied, then \beta_{YX} is recoverable by constructing an estimate from (a) the regression of Y on the predictors and auxiliary variables and (b) the regressions of each auxiliary variable on the predictors. For example, in the simple case where the structural equation for Y is linear and additive in X and W, we can use the well-known fact that

\underbrace{\beta_{YX}}_{\text{total effect}} = \underbrace{\beta_{YX.W}}_{\text{direct effect}} + \underbrace{\beta_{WX}\beta_{YW.X}}_{\text{indirect effect}}

In the present case, conditioning on the auxiliary variable W satisfies Condition 3 because (a) W d-separates Y from R_Y, and (b) W contains no missing values.

Necessity, sufficiency, and functional forms

So now let’s return to the m-graph in Figure 2, where X does not cause W. We see that if we do not condition on W by adjusting for it in the regression model, then we still meet the necessary condition for recoverability (Condition 1), but we fail to meet the sufficient conditions for recoverability (Conditions 2 and 3). So we fall in a sort of in-between case… is the estimate of \beta_{YX} recoverable or not? (Note that we saw above that it is recoverable if we apply the procedure dictated by Condition 3, but here we are asking if we can “directly” recover the estimate just from the simple regression of Y on X among the non-missing cases.)

It’s impossible to have complete necessary-and-sufficient conditions in this situation because \beta_{YX} may or may not be directly recoverable, depending on the functional forms of the structural equations. Remember that a DAG is totally nonparametric in that the structural equations it specifies can have any functional form and the variables can have any distributions (within reason). So for the DAG in Figures 1 and 2, the regression of Y on X and W could be a linear, additive function of X and W, or an interactive/multiplicative function, or any crazy nonlinear function. It turns out that, in this particular case, \beta_{YX} is directly recoverable if the effects of X and W on Y are linear and additive, but is not directly recoverable if the regression involves an X \times W interaction. This figure should help to give some intuition about why that’s the case:

Figure 5: The additive and interactive models have the same DAG, but \beta_{YX} is only directly recoverable in the additive model. The W=0 group (black) has no missing data, while the W=1 group (red) has 50% missing data. Open circles represent missing data points.

For simplicity, Figure 5 considers an example where the auxiliary variable W is a binary dummy variable. In the additive example, simply analyzing the non-missing cases yields the same estimate (on average) as if all the data were fully observed—whether or not we condition on W. In the interactive example, the group with more missing data has a greater slope, so analyzing the non-missing cases skews the total estimate toward the group with the smaller slope. But by conditioning on W, we can either (a) recover the conditional effect of X on Y given W (through Condition 2) simply by analyzing the non-missing cases, or (b) recover the marginal effect of X on Y (through Condition 3) by taking a weighted mean of the regression lines at W=0 and W=1, with the weights proportional to the number of cases in each group prior to missingness (which we know, despite the missingness, since the missing values still have partial records on the other observed variables).

Concluding remarks

While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic. It’s often assumed that having data missing not at random (MNAR) basically trashes your analysis if a non-trivial fraction of data are missing, which is frequently true in observational data. But the necessary and sufficient conditions laid out here suggest that, in fact, the simple strategy of listwise deletion—simply “ignoring” the missingness and analyzing the non-missing observations that are at hand—yields robust estimates under a pretty wide range of missingness mechanisms. And even in a lot of cases where you can’t just ignore the missing data problem, you can often still construct a consistent estimate of your parameter of interest (as in Condition 3) without needing to use fancy procedures like multiple imputation or distributional modeling of the predictors. Even further, you may still be able to recover the parameter estimates of interest even if you can’t satisfy the sufficient conditions given here (as in the final example above). Finally, and more anecdotally, simulations I played around with while preparing this post suggest that, even when the regression coefficient of interest is not technically recoverable, the magnitude of bias is often small under many realistic conditions (see, for example, the magnitude of bias in the simulations of Thoemmes et al., 2015).

References and footnotes

Daniel, R. M., Kenward, M. G., Cousens, S. N., & De Stavola, B. L. (2012). Using causal diagrams to guide analysis in missing data problemsStatistical methods in medical research21(3), 243-256.

Mohan, K., Pearl, J., & Tian, J. (2013). Graphical models for inference with missing data. In Advances in neural information processing systems (pp. 1277-1285).

Mohan, K., & Pearl, J. (2014). Graphical models for recovering probabilistic and causal queries from missing data. In Advances in Neural Information Processing Systems (pp. 1520-1528).

Thoemmes, F., & Rose, N. (2014). A cautious note on auxiliary variables that can increase bias in missing data problemsMultivariate behavioral research49(5), 443-459.

Thoemmes, F., & Mohan, K. (2015). Graphical representation of missing data problemsStructural Equation Modeling: A Multidisciplinary Journal22(4), 631-642.

1 Recall that in DAGs, doubled-headed arrows like A \leftrightarrow B are just shorthand notation for A \leftarrow L \rightarrow B, indicating an unobserved common cause L.

2 Rubin’s third type of missingness, missing completely at random (MCAR), would be represented graphically by the relevant missingness node R being completely disconnected from all other nodes in the graph. An example of this would be if we flipped a coin for each observation and deleted the corresponding value when the coin landed heads. Although commonly assumed for convenience, MCAR is not actually very common in practice unless it is deliberately built into the study design.