Category Archives: classification

Logistic regression is not fucked

Summary: Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to. 

A few months ago I wrote a blog post on using causal graphs to understand missingness and how to deal with it, which concluded on a rather positive note:

“While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic…”

Well, today I write about another statistical issue with a basically optimistic message. So I decided this could be like the second installment in a “some things are not fucked” series, in which I look at a few issues/methods that some people have claimed are fucked, but which are in fact not fucked.1

In this installment we consider logistic regression — or, more generally, Binomial regression (including logit and probit regression).2

Wait, who said logistic regression was fucked?

The two most widely-read papers sounding the alarm bells seem to be Allison (1999) and Mood (2010). The alleged problems are stated most starkly by Mood (2010, pp. 67-68):

  1. “It is problematic to interpret [logistic regression coefficients] as substantive effects, because they also reflect unobserved heterogeneity.
  2. It is problematic to compare [logistic regression coefficients] across models with different independent variables, because the unobserved heterogeneity is likely to  vary across models.
  3. It is problematic to compare [logistic regression coefficients] across samples, across groups within samples, or over time—even when we use models with the same independent variables—because the unobserved heterogeneity can vary across the compared samples, groups, or points in time.”

These are pretty serious allegations.3  These concerns have convinced some people to abandon logistic regression in favor of the so-called linear probability model, which just means using classical regression directly on the binary outcome, although usually with heteroskedasticity-robust standard errors. To the extent that’s a bad idea—which, to be fair, is a matter of debate, but is probably ill-advised as a default method at the very least—it’s important that we set the record straight.

The allegations refer to “unobserved heterogeneity.” What exactly is this unobserved heterogeneity and where does it come from? There are basically two underlying lines of argument here that we must address. Both arguments lead to a similar conclusion—and previous sources have sometimes been a bit unclear by drawing from both arguments more or less interchangeably in order to reach this conclusion—but they rely on fundamentally distinct premises, so we must clearly distinguish these arguments and address them separately. The counterarguments I describe below are very much in the same spirit as those of Kuha and Mills (2017).

First argument: Heteroskedasticity in the latent outcome

A standard way of motivating the probit model for binary outcomes (e.g., from Wikipedia) is the following. We have an unobserved/latent outcome variable Y^* that is normally distributed, conditional on the predictor X. Specifically, the model for the ith observation is

Y_i^* = \beta_0^* + \beta_1^*X_i + \epsilon_i\text{, with }\varepsilon_i \sim \text{Normal}(0, \sigma^2).

The latent variable Y^* is subjected to a thresholding process, so that the discrete outcome we actually observe is

Y_i = \begin{cases} 1, & \text{if}\ Y_i^* \ge T \\ 0, & \text{if}\ Y_i^* < T\end{cases}.

This leads the probability of Y=1 given X to take the form of a Normal CDF, with mean and standard deviation a function of \beta_1^*, \sigma^2, and the deterministic threshold T. So the probit model is basically motivated as a way of estimating \beta_1^* from this latent regression of Y^* on X, although on a different scale. This latent variable interpretation is illustrated in the plot below, from Thissen & Orlando (2001). These authors are technically discussing the normal ogive model from item response theory, which looks pretty much like probit regression for our purposes.

Note the differences in notation: these authors use \theta in place of X, \gamma in place of T, u in place of Y, \mu in place of Y^*, and they write probabilities with T() instead of the usual P() or Pr().

We can interpret logistic regression in pretty much exactly the same way. The only difference is that now the unobserved continuous Y^* follows not a normal distribution, but a similarly bell-shaped logistic distribution given X. A theoretical argument for why Y^* might follow a logistic distribution rather than a normal distribution is not so clear, but since the resulting logistic curve looks essentially the same as the normal CDF for practical purposes (after some rescaling), it won’t tend to matter much in practice which model you use. The point is that both models have a fairly straightforward interpretation involving a continuous latent variable Y^* and an unobserved, deterministic threshold T.

So the first argument for logistic regression being fucked due to “unobserved heterogeneity” comes from asking: What if the residual variance \sigma^2 is not constant, as assumed in the model above, but instead is different at different values of X? Well, it turns out that, as far as our observable binary outcome Y is concerned, this heteroskedasticity can be totally indistinguishable from the typically assumed situation where \sigma^2 is constant and Y^* is increasing (or decreasing) with X. This is illustrated in Figure 2 below.

Figure 2.

Suppose we are comparing the proportions of positive responses (i.e., Y=1) between two groups of observations, Group 1 and Group 2. To give this at least a little bit of context, maybe Group 2 are human subjects of some social intervention that attempts to increase participation in local elections, Group 1 is a control group, the observed binary Y is whether the person voted in a recent election, and the latent continuous Y^* is some underlying “propensity to vote.” Now we observe a voting rate of 10% in the control group (Group 1) and 25% in the experimental group (Group 2). In terms of our latent variable model, we’d typically assume this puts us in Scenario A from Figure 2: The intervention increased people’s “propensity to vote” (Y^*) on average, which pushed more people over the threshold T in Group 2 than in Group 1, which led to a greater proportion of voters in Group 2.

The problem is that these observed voting proportions can be explained equally well by assuming that the intervention had 0 effect on the mean propensity to vote, but instead just led to greater variance in the propensities to vote. As illustrated in Scenario B of Figure 2, this could just as well have led the voting proportion to increase from 10% in the control group to 25% in the experimental group, and it’s a drastically different (and probably less appealing) conceptual interpretation of the results.

Another possibility that would fit the data equally well (but isn’t discussed as often) is that the intervention had no effect at all on the distribution of Y^*, but instead just lowered the threshold for Group 2, so that even people with a lower “propensity to vote” were able to drag themselves to the polls. This is illustrated in Scenario C of Figure 2.

So you can probably see how this supports the three allegations cited earlier. When we observe a non-zero estimate for a logistic regression coefficient, we can’t be sure this actually reflects a shift in the mean of the underlying continuous latent variable (e.g., increased propensity to vote), because it also reflects latent heteroskedasticity, and we can’t tell these two explanations apart. And because the degree of heteroskedasticity could easily differ between models, between samples, or over time, even comparing logistic regression coefficients to one another is problematic…if shifts in the underlying mean are what we care about.

Are shifts in an underlying mean what we care about?

There’s the crux of the matter. This entire first line of argument presupposes that all of the following are true for the case at hand:

  1. It makes any conceptual sense to think of the observed binary Y as arising from a latent continuous Y^* and a deterministic threshold.
  2. We actually care how much of the observed effect on Y is due to mean shifts in Y^* vs. changes in the variance of Y^*.
  3. We’ve observed only a single binary indicator of Y^*, so that Scenarios A and B from Figure 2 are empirically indistinguishable.

In my experience, usually at least one of these is false. For example, if the observed binary Y indicates survival of patients in a medical trial, what exactly would an underlying Y^* represent? It could make sense for the patients who survived—maybe it represents their general health or something—but surely all patients with Y=0 are equally dead! Returning to the voting example, we can probably grant that #1 is true: it probably does make conceptual sense to think about an underlying, continuous “propensity to vote.” But #2 is probably false: I couldn’t care less if the social intervention increased voting by increasing propensity to vote, spreading out the distribution of voting propensities, or just altering the threshold that turns the propensity into voting behavior… I just want people to vote!

Finally, when #1 and #2 are true, so that the investigator is primarily interested not in the observed Y but rather in some underlying latent Y^*, in my experience the investigator will usually have taken care to collect data on multiple binary indicators of Y^*—in other words, #3 will be false. For example, if I were interested in studying an abstract Y^* like “political engagement,” I would certainly view voting as a binary indicator of that, but I would also try to use data on things like whether that person donated money to political campaigns, whether they attended any political conventions, and so on. And when there are multiple binary indicators of Y^*, it then becomes possible to empirically distinguish Scenario A from Scenario B in Figure 2, using, for example, statistical methods from item response theory.

These counterarguments are not to say that this first line of argument is invalid or irrelevant. The premises do lead to the conclusion, and there are certainly situations where those premises are true. If you find yourself in one of those situations, where #1-#3 are all true, then you do need to heed the warnings of Allison (1999) and Mood (2010). The point of these counterarguments is to say that, far more often than not, at least one of the premises listed above will be false. And in those cases, logistic regression is not fucked.

Okay, great. But we’re not out of the woods yet. As I mentioned earlier, there’s a second line of argument that leads us to essentially the same conclusions, but that makes no reference whatsoever to a continuous latent Y^*.

Second argument: Omitted non-confounders in logistic regression

To frame the second argument, first think back to the classical regression model with two predictors, a focal predictor X and some covariate C:

Y_i = \beta^A_0 + \beta^A_1X_i + \beta^A_2C_i + \varepsilon^A_i.

Now suppose we haven’t observed the covariate C, so that the regression model we actually estimate is

Y_i = \beta^B_0 + \beta^B_1X_i + \varepsilon^B_i.

In the special case where C is uncorrelated with X, we know that \beta^B_1 =\beta^A_1, so that our estimate of the slope for X will, on average, be the same either way. The technical name for this property is collapsibility: classical regression coefficients are said to be collapsible over uncorrelated covariates.

It turns out that logistic regression coefficients do not have this collapsibility property. If a covariate that’s correlated with the binary outcome is omitted from the logistic regression equation, then the slopes for the remaining observed predictors will be affected, even if the omitted covariate is uncorrelated with the observed predictors. Specifically, in the case of omitting an uncorrelated covariate, the observed slopes will be driven toward 0 to some extent.

This is all illustrated below in Figure 3, where the covariate C is shown as a binary variable (color = red vs. blue) for the sake of simplicity.

Figure 3. In the classical regression case, the simple/unconditional regression line (black) has the same slope as the group-specific regression lines (red and blue). In the logistic regression case, they are not equal: the simple/unconditional regression line is much more shallow than the group-specific regression lines. In both cases there is no confounding: the predictor X and the grouping factor (color = red vs. blue) have zero correlation, that is, X has the same mean among the red points and the blue points.

So now we can lay out the second line of argument against logistic regression. Actually, the most impactful way to communicate the argument is not to list out the premises, but instead to use a sort of statistical intuition pump. Consider the data in the right-hand panel of Figure 3. The slope (logistic regression coefficient) of X on Y is, let’s say, \beta=2 for both the red group and the blue group. But suppose the color grouping factor is not observed, so that we can only fit the simple/unconditional logistic regression that ignores the color groups. Because of the non-collapsibility of logistic regression coefficients, the slope from this regression (shown in black in Figure 2) is shallower, say, \beta=1. But if the slope is \beta=2 among both the red and the blue points, and if every point is either red or blue, then who exactly does this \beta=1 slope apply to? What is the substantive interpretation of this slope?

For virtually every logistic regression model that we estimate in the real world, there will be some uncorrelated covariates that are statistically associated with the binary outcome, but that we couldn’t observe to include in the model. In other words, there’s always unobserved heterogeneity in our data on covariates we couldn’t measure. But then—the argument goes—how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?

These are rhetorical questions. The implication is that no meaningful interpretation is possible—or, as Mood (2010, p. 67) puts it, “it is problematic to interpret [logistic regression coefficients] as substantive effects.” I beg to differ. As I argue next, we can interpret logistic regression coefficients perfectly well even in the face of non-collapsibility.

Logistic regression coefficients are about conditional probabilities

Specifically, we can write logistic regression models directly analogous to models A and B from above as:

f\big(\text{Pr}(Y_i=1|X_i=x,C_i=c)\big) = \beta^A_0 + \beta^A_1x + \beta^A_2c,

f\big(\text{Pr}(Y_i=1|X_i=x)\big) = \beta^B_0 + \beta^B_1x,

where f(\cdot) is the logit link function. As the left-hand-sides of these regression equations make clear, \beta^A_1 tells us about differences in the probability of Y as X increases conditional on the covariate C being fixed at some value c, while \beta^B_1 tells us about differences in the probability of Y as X increases marginal over C. There is no reason to expect these two things to coincide in general unless \text{Pr}(Y|X,C)=\text{Pr}(Y|X), which we know from probability theory is only true when Y and C are conditionally independent given X—in terms of our model, when \beta^A_2=0.

So now let’s return to the red vs. blue example of Figure 3. We supposed, for illustration’s sake, a slope of \beta^B_1=1 overall, ignoring the red vs. blue grouping. Then the first rhetorical question from before asked, “who exactly does this \beta^B_1=1 slope apply to?” The answer is that it applies to a population in which we know the X values but we don’t know the C values, that is, we don’t know the color of any of the data points. There’s an intuition that if \beta^A_1=2 among both the red and blue points, then for any new point whose color we don’t know, we ought to guess that the slope that applies to them is also 2. But that presupposes that we were able to estimate slopes among both the red and blue groups, which would imply that we did observe the colors of at least some of the points. On the contrary, let me repeat: the \beta^B_1=1 slope applies to a population in which we know the X values but we don’t know any of the C values. Put more formally, the \beta^B_1=1 slope refers to changes in \text{Pr}(Y|X) = \text{E}_C\big[\text{Pr}(Y|X,C)\big]; there is an intuition that these probabilities ought to equal \text{Pr}\big(Y|X,C=\text{E}[C]\big), but these are not the same because the latter still require conditioning on C.

The second rhetorical question from above asked, “how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?” The answer is that we interpret them conditional on all and only the covariates that were included in the model. Again, conceptually speaking, the coefficients refer to a population in which we know the values of the covariates represented in the model and nothing more. There’s no problem with comparing these coefficients between samples or over time as long as these coefficients refer to the same population, that is, populations where the same sets of covariates are observed.

As for comparing coefficients between models with different covariates? Here we must agree with Mood and Allison that, in most cases, these comparisons are probably not informative. But this is not because of “unobserved heterogeneity.” It’s because these coefficients refer to different populations of units. In terms of models A and B from above, \beta^A_1 and \beta^B_1 represent completely different conceptual quantities and it’s a mistake to view estimates of \beta^B_1 as somehow being deficient estimates of \beta^A_1. As a more general rule, parameters from different models usually mean different things—compare them at your peril. In the logistic regression case, there may be situations where it makes sense to compare estimates of \beta^A_1 with estimates of \beta^B_1, but not because one thinks they ought to be estimating the same quantity.

Footnotes and References

1 Or which, at least, are not fucked for the given reasons, although they could still be fucked for unrelated reasons.
2 This stuff is also true for some survival analysis models, notably Cox regression.
3 At least, I think they are… a definition of “substantive effects” is never given (are they like causal effects?), but presumably they’re something we want in an interpretation.

Allison, P. D. (1999). Comparing logit and probit coefficients across groupsSociological methods & research28(2), 186-208.

Kuha, J., & Mills, C. (2017). On group comparisons with logistic regression modelsSociological Methods & Research, 0049124117747306.

Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about itEuropean sociological review26(1), 67-82.

Pang, M., Kaufman, J. S., & Platt, R. W. (2013). Studying noncollapsibility of the odds ratio with marginal structural and logistic regression modelsStatistical methods in medical research25(5), 1925-1937.

Rohwer, G. (2012). Estimating effects with logit models. NEPS Working Paper 10, German National Educational Panel Study, University of Bamberg.

Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.