Summary: Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to.
A few months ago I wrote a blog post on using causal graphs to understand missingness and how to deal with it, which concluded on a rather positive note:
“While I generally agree that pretty much everything is fucked in non-hard sciences, I think the lessons from this analysis of missing data are actually quite optimistic…”
Well, today I write about another statistical issue with a basically optimistic message. So I decided this could be like the second installment in a “some things are not fucked” series, in which I look at a few issues/methods that some people have claimed are fucked, but which are in fact not fucked.1
Wait, who said logistic regression was fucked?
The two most widely-read papers sounding the alarm bells seem to be Allison (1999) and Mood (2010). The alleged problems are stated most starkly by Mood (2010, pp. 67-68):
- “It is problematic to interpret [logistic regression coefficients] as substantive effects, because they also reflect unobserved heterogeneity.
- It is problematic to compare [logistic regression coefficients] across models with different independent variables, because the unobserved heterogeneity is likely to vary across models.
- It is problematic to compare [logistic regression coefficients] across samples, across groups within samples, or over time—even when we use models with the same independent variables—because the unobserved heterogeneity can vary across the compared samples, groups, or points in time.”
These are pretty serious allegations.3 These concerns have convinced some people to abandon logistic regression in favor of the so-called linear probability model, which just means using classical regression directly on the binary outcome, although usually with heteroskedasticity-robust standard errors. To the extent that’s a bad idea—which, to be fair, is a matter of debate, but is probably ill-advised as a default method at the very least—it’s important that we set the record straight.
The allegations refer to “unobserved heterogeneity.” What exactly is this unobserved heterogeneity and where does it come from? There are basically two underlying lines of argument here that we must address. Both arguments lead to a similar conclusion—and previous sources have sometimes been a bit unclear by drawing from both arguments more or less interchangeably in order to reach this conclusion—but they rely on fundamentally distinct premises, so we must clearly distinguish these arguments and address them separately. The counterarguments I describe below are very much in the same spirit as those of Kuha and Mills (2017).
First argument: Heteroskedasticity in the latent outcome
A standard way of motivating the probit model for binary outcomes (e.g., from Wikipedia) is the following. We have an unobserved/latent outcome variable that is normally distributed, conditional on the predictor . Specifically, the model for the th observation is
The latent variable is subjected to a thresholding process, so that the discrete outcome we actually observe is
This leads the probability of given to take the form of a Normal CDF, with mean and standard deviation a function of , , and the deterministic threshold . So the probit model is basically motivated as a way of estimating from this latent regression of on , although on a different scale. This latent variable interpretation is illustrated in the plot below, from Thissen & Orlando (2001). These authors are technically discussing the normal ogive model from item response theory, which looks pretty much like probit regression for our purposes.
We can interpret logistic regression in pretty much exactly the same way. The only difference is that now the unobserved continuous follows not a normal distribution, but a similarly bell-shaped logistic distribution given . A theoretical argument for why might follow a logistic distribution rather than a normal distribution is not so clear, but since the resulting logistic curve looks essentially the same as the normal CDF for practical purposes (after some rescaling), it won’t tend to matter much in practice which model you use. The point is that both models have a fairly straightforward interpretation involving a continuous latent variable and an unobserved, deterministic threshold .
So the first argument for logistic regression being fucked due to “unobserved heterogeneity” comes from asking: What if the residual variance is not constant, as assumed in the model above, but instead is different at different values of ? Well, it turns out that, as far as our observable binary outcome is concerned, this heteroskedasticity can be totally indistinguishable from the typically assumed situation where is constant and is increasing (or decreasing) with . This is illustrated in Figure 2 below.
Suppose we are comparing the proportions of positive responses (i.e., ) between two groups of observations, Group 1 and Group 2. To give this at least a little bit of context, maybe Group 2 are human subjects of some social intervention that attempts to increase participation in local elections, Group 1 is a control group, the observed binary is whether the person voted in a recent election, and the latent continuous is some underlying “propensity to vote.” Now we observe a voting rate of 10% in the control group (Group 1) and 25% in the experimental group (Group 2). In terms of our latent variable model, we’d typically assume this puts us in Scenario A from Figure 2: The intervention increased people’s “propensity to vote” () on average, which pushed more people over the threshold in Group 2 than in Group 1, which led to a greater proportion of voters in Group 2.
The problem is that these observed voting proportions can be explained equally well by assuming that the intervention had 0 effect on the mean propensity to vote, but instead just led to greater variance in the propensities to vote. As illustrated in Scenario B of Figure 2, this could just as well have led the voting proportion to increase from 10% in the control group to 25% in the experimental group, and it’s a drastically different (and probably less appealing) conceptual interpretation of the results.
Another possibility that would fit the data equally well (but isn’t discussed as often) is that the intervention had no effect at all on the distribution of , but instead just lowered the threshold for Group 2, so that even people with a lower “propensity to vote” were able to drag themselves to the polls. This is illustrated in Scenario C of Figure 2.
So you can probably see how this supports the three allegations cited earlier. When we observe a non-zero estimate for a logistic regression coefficient, we can’t be sure this actually reflects a shift in the mean of the underlying continuous latent variable (e.g., increased propensity to vote), because it also reflects latent heteroskedasticity, and we can’t tell these two explanations apart. And because the degree of heteroskedasticity could easily differ between models, between samples, or over time, even comparing logistic regression coefficients to one another is problematic…if shifts in the underlying mean are what we care about.
Are shifts in an underlying mean what we care about?
There’s the crux of the matter. This entire first line of argument presupposes that all of the following are true for the case at hand:
- It makes any conceptual sense to think of the observed binary as arising from a latent continuous and a deterministic threshold.
- We actually care how much of the observed effect on is due to mean shifts in vs. changes in the variance of .
- We’ve observed only a single binary indicator of , so that Scenarios A and B from Figure 2 are empirically indistinguishable.
In my experience, usually at least one of these is false. For example, if the observed binary indicates survival of patients in a medical trial, what exactly would an underlying represent? It could make sense for the patients who survived—maybe it represents their general health or something—but surely all patients with are equally dead! Returning to the voting example, we can probably grant that #1 is true: it probably does make conceptual sense to think about an underlying, continuous “propensity to vote.” But #2 is probably false: I couldn’t care less if the social intervention increased voting by increasing propensity to vote, spreading out the distribution of voting propensities, or just altering the threshold that turns the propensity into voting behavior… I just want people to vote!
Finally, when #1 and #2 are true, so that the investigator is primarily interested not in the observed but rather in some underlying latent , in my experience the investigator will usually have taken care to collect data on multiple binary indicators of —in other words, #3 will be false. For example, if I were interested in studying an abstract like “political engagement,” I would certainly view voting as a binary indicator of that, but I would also try to use data on things like whether that person donated money to political campaigns, whether they attended any political conventions, and so on. And when there are multiple binary indicators of , it then becomes possible to empirically distinguish Scenario A from Scenario B in Figure 2, using, for example, statistical methods from item response theory.
These counterarguments are not to say that this first line of argument is invalid or irrelevant. The premises do lead to the conclusion, and there are certainly situations where those premises are true. If you find yourself in one of those situations, where #1-#3 are all true, then you do need to heed the warnings of Allison (1999) and Mood (2010). The point of these counterarguments is to say that, far more often than not, at least one of the premises listed above will be false. And in those cases, logistic regression is not fucked.
Okay, great. But we’re not out of the woods yet. As I mentioned earlier, there’s a second line of argument that leads us to essentially the same conclusions, but that makes no reference whatsoever to a continuous latent .
Second argument: Omitted non-confounders in logistic regression
To frame the second argument, first think back to the classical regression model with two predictors, a focal predictor and some covariate :
Now suppose we haven’t observed the covariate , so that the regression model we actually estimate is
In the special case where is uncorrelated with , we know that , so that our estimate of the slope for will, on average, be the same either way. The technical name for this property is collapsibility: classical regression coefficients are said to be collapsible over uncorrelated covariates.
It turns out that logistic regression coefficients do not have this collapsibility property. If a covariate that’s correlated with the binary outcome is omitted from the logistic regression equation, then the slopes for the remaining observed predictors will be affected, even if the omitted covariate is uncorrelated with the observed predictors. Specifically, in the case of omitting an uncorrelated covariate, the observed slopes will be driven toward 0 to some extent.
This is all illustrated below in Figure 3, where the covariate is shown as a binary variable (color = red vs. blue) for the sake of simplicity.
So now we can lay out the second line of argument against logistic regression. Actually, the most impactful way to communicate the argument is not to list out the premises, but instead to use a sort of statistical intuition pump. Consider the data in the right-hand panel of Figure 3. The slope (logistic regression coefficient) of on is, let’s say, for both the red group and the blue group. But suppose the color grouping factor is not observed, so that we can only fit the simple/unconditional logistic regression that ignores the color groups. Because of the non-collapsibility of logistic regression coefficients, the slope from this regression (shown in black in Figure 2) is shallower, say, . But if the slope is among both the red and the blue points, and if every point is either red or blue, then who exactly does this slope apply to? What is the substantive interpretation of this slope?
For virtually every logistic regression model that we estimate in the real world, there will be some uncorrelated covariates that are statistically associated with the binary outcome, but that we couldn’t observe to include in the model. In other words, there’s always unobserved heterogeneity in our data on covariates we couldn’t measure. But then—the argument goes—how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?
These are rhetorical questions. The implication is that no meaningful interpretation is possible—or, as Mood (2010, p. 67) puts it, “it is problematic to interpret [logistic regression coefficients] as substantive effects.” I beg to differ. As I argue next, we can interpret logistic regression coefficients perfectly well even in the face of non-collapsibility.
Logistic regression coefficients are about conditional probabilities
Specifically, we can write logistic regression models directly analogous to models A and B from above as:
where is the logit link function. As the left-hand-sides of these regression equations make clear, tells us about differences in the probability of as increases conditional on the covariate being fixed at some value , while tells us about differences in the probability of as increases marginal over . There is no reason to expect these two things to coincide in general unless , which we know from probability theory is only true when and are conditionally independent given —in terms of our model, when .
So now let’s return to the red vs. blue example of Figure 3. We supposed, for illustration’s sake, a slope of overall, ignoring the red vs. blue grouping. Then the first rhetorical question from before asked, “who exactly does this slope apply to?” The answer is that it applies to a population in which we know the values but we don’t know the values, that is, we don’t know the color of any of the data points. There’s an intuition that if among both the red and blue points, then for any new point whose color we don’t know, we ought to guess that the slope that applies to them is also . But that presupposes that we were able to estimate slopes among both the red and blue groups, which would imply that we did observe the colors of at least some of the points. On the contrary, let me repeat: the slope applies to a population in which we know the values but we don’t know any of the values. Put more formally, the slope refers to changes in ; there is an intuition that these probabilities ought to equal , but these are not the same because the latter still require conditioning on .
The second rhetorical question from above asked, “how can we interpret the slopes from any logistic regression model that we estimate, since we know that the estimates would change as soon as we included additional relevant covariates, even when there’s no confounding?” The answer is that we interpret them conditional on all and only the covariates that were included in the model. Again, conceptually speaking, the coefficients refer to a population in which we know the values of the covariates represented in the model and nothing more. There’s no problem with comparing these coefficients between samples or over time as long as these coefficients refer to the same population, that is, populations where the same sets of covariates are observed.
As for comparing coefficients between models with different covariates? Here we must agree with Mood and Allison that, in most cases, these comparisons are probably not informative. But this is not because of “unobserved heterogeneity.” It’s because these coefficients refer to different populations of units. In terms of models A and B from above, and represent completely different conceptual quantities and it’s a mistake to view estimates of as somehow being deficient estimates of . As a more general rule, parameters from different models usually mean different things—compare them at your peril. In the logistic regression case, there may be situations where it makes sense to compare estimates of with estimates of , but not because one thinks they ought to be estimating the same quantity.
Footnotes and References
1 Or which, at least, are not fucked for the given reasons, although they could still be fucked for unrelated reasons.
2 This stuff is also true for some survival analysis models, notably Cox regression.
3 At least, I think they are… a definition of “substantive effects” is never given (are they like causal effects?), but presumably they’re something we want in an interpretation.
Allison, P. D. (1999). Comparing logit and probit coefficients across groups. Sociological methods & research, 28(2), 186-208.
Kuha, J., & Mills, C. (2017). On group comparisons with logistic regression models. Sociological Methods & Research, 0049124117747306.
Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European sociological review, 26(1), 67-82.
Pang, M., Kaufman, J. S., & Platt, R. W. (2013). Studying noncollapsibility of the odds ratio with marginal structural and logistic regression models. Statistical methods in medical research, 25(5), 1925-1937.
Rohwer, G. (2012). Estimating effects with logit models. NEPS Working Paper 10, German National Educational Panel Study, University of Bamberg.
Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.