You’ve likely heard that the flooding in Houston following Hurricane Harvey is reportedly a 500-year or even 1000-year flood event. You’ve perhaps also heard that this is the third 500-year flood that Houston has experienced in a three-year span, which calls into serious question the usefulness or accuracy of the “500-year flood” designation. This made me wonder: * what’s our actual best estimate of the yearly risk for a Harvey-level flood, according to the data?* That is the question I will attempt to answer here.

As a spoiler, I estimate that Harvey caused ** roughly 50 – 120 year flood events** at most sites in Harris county (the Texas county where Houston is located) where we have data—

*but*the estimates depend strongly on the specific site, with estimates at some sites being unremarkably small, and estimates at one site (Vince Bayou at Pasadena, TX) indeed hovering near

**the 1000 year neighborhood**. In the rest of this post I will describe how I arrived at these estimates. I’ve posted an R Markdown HTML document which gives all the code for reproducing this analysis, plus some additional tidbits.

Disclaimer: I am not a hydrologist. In fact, I know almost nothing about hydrology. What I am is an out-of-work data scientist. I don’t claim that the analysis I present below is particularly savvy or sophisticated — it’s possible I’ve overlooked some issues that would be more obvious to the hydrologically initiated — but I do think my analysis at least addresses all or most of the major methodological issues that need to be handled in such an analysis.

### The *N*-year flood concept

As the sources I linked above explain very clearly, the *N*-year flood is a probabilistic notion based on the estimated chance that flood waters will reach a certain height in any given year. In a nutshell, an *N*-year flood is defined as the flood water height that a given area has a 1/*N* probability of reaching in a particular year. For example, a 500-year flood has a probability 1/500 or 0.2% chance of occurring in a particular year, *by definition*. Now, it does not automatically follow that an *N*-year flood is expected to occur about every *N* years on average, but that conclusion *does* follow if we add the somewhat reasonable assumption that the flood events are independent from one year to the next^{1}. That assumption is probably not exactly true, due to factors like El Niño / La Niña cycles (or Atlantic Niño in Harvey’s case), but probably a good enough approximation that the intuitive interpretation “500-year floods are expected every 500 years” is reasonably true.

Now, it would be fantastically unlikely for Houston to experience three 500-year floods in three years if these were “truly” 500-year floods. A much more likely explanation is that the underlying risk estimate — 0.2% chance of occurring in a year — is based on a bad statistical model. We can use the historical stream gage^{2} height data from the U.S. Geological Survey (USGS; see also Harris county, TX’s very nice interface) to come up with a better estimate.

Here’s a very high-level outline of what the analysis looks like:

- Query the USGS web server to retrieve the last 20 years of gage height data from all stream gage sites in Harris County, TX.
- Detrend the data using a one-year rolling median.
- Estimate the cumulative distributive function (CDF) for each site’s data using kernel methods. Then locate Harvey in the CDFs to obtain the daily risk estimates
- Convert the daily risk estimates to
*N*-year flood estimates by making some assumptions about the form of temporal dependence. - Bootstrap to obtain estimates of uncertainty.

Retrieving the data is not too exciting, so I won’t say much about it in this summary post other than to point out the USGS Daily Values Web Service, which is a really nice system the USGS has set up for retrieving nicely formatted data via HTTP requests. (For those interested in this sort of thing, all the details and syntax are documented in the R Markdown document.) Instead we’ll jump ahead to the part where we have a nice, clean dataset.

### Detrending the data

The data are daily mean gage heights for 21 stream gages spread through Harris County, TX. There are more than 21 gage sites in the county, but these are the ones that had mean gage height data available^{3}. Here’s what the data look like at each site:

It mostly looks pretty well-behaved, with the obvious exception being some huge, discontinuous jumps in gage height that appear at some of the sites. The top 6 sites show the biggest jumps ; as you can see, these are all (and the only ones) from Cypress Creek. I have no idea what went down at Cypress Creek to cause these discontinuities, or why the jumps don’t at least all happen at or around the same time.

In principle what we care about in measuring these flood levels are the deviations from the *typical* stream heights, whatever those typical heights might be at that time, even if they have changed over the years (whether that’s through a slow drift or a big change all at once). In other words, we are interested in the detrended data. Plus, more practically speaking, we want to be able to still use the data prior to those discontinuous jumps in stream height that we saw above. One way or another, that requires accounting for the jumps by detrending them out.

A rolling median is a good candidate method for a nonparametric estimator of the trend line here because (a) it won’t be affected much by big, sudden shocks to the gage height (e.g., from floods), thereby leaving that variability to go into the CDF, which we want; and (b) in theory, a rolling median can model the discontinuous jumps much better than, say, a rolling mean because the median is not a “smooth” statistic like the mean is.

There is the question of the optimal window width for the rolling median (i.e., how many observations to take the median over in each direction). One idea would be to regard this as a hyperparameter to be tuned via cross-validation, * but* note that in this case minimizing the expected prediction error is not really our goal. Rather, our goal is to remove the biggest, coarsest trends in the data, while leaving all of the variability around that trend line in the data so that it will go into the estimated CDF. The trend line that minimizes the expected prediction error would actually follow the data a bit too closely for our purposes, getting pulled upward by large flood events and thereby eating up some of the variability that we want to keep. In other words, we want the trend line to underfit the data to some extent.

With a little experimentation I found that a one-year window worked pretty well. Here’s an example of what that looks like at a single representative site:

Looks great! The trend line remains relatively flat even in the presence of large local floods, but still perfectly captures the weird discontinuous jump in the time series as well as any very slow drifts over time.

Let’s fit one-year moving medians to all of the sites, detrend the data around those median lines, and then examine the resulting detrended data:

It looks pretty good! There are a handful of odd downward spikes that are probably the result of the median line not capturing the discontinuous jumps as gracefully as possible, but these seem small enough not to worry about.

### Estimating the CDFs

This step is pretty straightforward, at least thanks to the sROC package, which has nice functionality for kernel estimation of CDFs. Here’s an example of what the probability density function (PDF) and CDF look like for one representative site, and where the Harvey floods fall in this distribution.

Okay, so now we can look up the Harvey flood in the CDFs for each site, which will tell us the proportion of the distribution that is less than the Harvey flood. Let one minus this proportion be , the probability of a flood at least as great as Harvey on a particular day. Note that it’s for a particular day, not a particular year, because our data are *daily* gage heights. Now we just need to convert this to , the probability for a particular year, and then take to obtain the *N*-year flood estimate. So how do we do the conversion?

### Modeling the temporal dependence

If the daily flood events were independent then we could use

,

which is the probability of at least one “success” from a Binomial distribution with 365 trials. But the daily flood events aren’t really independent: if there’s a flood event on one day, that substantially raises the probability that there will still be a flood event on the next day, because it takes a few days for the water levels to recede to below flood level. For example, here’s what the data look like from the big floods in Houston last year, summer 2016, again from our favorite representative gage site:

I drew a hypothetical flood height level to represent a particular *N*-year flood, just for illustrative purposes (I have no idea what flood level this line would actually correspond to). We can see that in this particular case, we had a string of five *N*-year flood days in a row. More generally, this should make it clear that we expect the daily flood events to be dependent in that they tend to come in clusters. So to convert to for Harvey, we need a model of this dependence.

A simple model is that the daily flood events show Markov or one-day-back dependence: the probability of a flood event today depends on the presence of a flood event yesterday, but is conditionally independent of flooding 2 days ago given the flood status yesterday. For this model we add a parameter : the conditional probability of a flood event, given that there was a flood event the previous day. Then the conversion formula from above becomes^{4}

,

which reduces to the previous formula when , meaning the events are independent.

To get a feel for how this new parameter affects the model estimates, we can plot the *N*-year estimates from all 21 sites as a function of the assumed value of (with set to the previously estimated values):

As you can see, the assumed value of doesn’t have a huge impact until it gets up to around, say, 0.6 or so, but then it starts to matter quite a lot. To think about what plausible values of might be, it helps to notice that, according to our dependence model, flood durations follow a Geometric distribution with parameter , implying that the expected flood duration is days. So, for example, when , then we expect it to take an average of 4 days before water levels recede to below the relevant *N*-year flood level (but not necessarily back to normal levels).

We’ve already developed a good estimate of using the kernel CDFs. To estimate , basically we can (a) choose a particular flood height, ; (b) categorize the gage heights at any given time point as being either above or below , giving us or , respectively; and (c) create a contingency table of vs. (the value on the previous day), which gives us a table like:

from which we can compute an estimate of as . Now this estimate, of course, depends on the flood height that we chose, and it’s certainly possible that there could really be different values of for different flood heights. So here we plot the estimated values of for all 21 sites, for heights ranging from the 90th percentile to whatever percentile corresponds to the Harvey floods at each site (in other words, for that site):

Immediately we see that there is considerable variation across gages in their level of dependence, and that the degree of dependence depends on the flood height (in an interesting, non-monotonic way). Because the estimates depend fairly strongly on the site and on the height of flooding, it wouldn’t be a good idea to use an aggregate estimate that averages across all sites and/or all flood heights. Instead, for our *N*-year flood estimates we will use each site’s individual estimate at the value corresponding to the Harvey flood height at that site.

We now have point estimates of and for all sites, which means we could technically now compute our *N*-year flood estimates for each site. But to use only the point estimates would be to ignore the substantial uncertainty associated with these estimates! Furthermore, because we obtain the *N*-year estimates from a convex transformation of and , using only the point estimates would lead us to systematically underestimate the *N*-year flood levels (due to Jensen’s inequality). So we will use bootstrapping to preserve the variability of the estimates.

### Bootstrapping to account for uncertainty

Because our data are time series, a simple nonparametric bootstrap is not a good idea because the temporal structure of the data (such as the autocorrelation) would be destroyed in each bootstrap sample, which would ruin our estimates (although the estimates would probably still be okay). So here we will apply a more specialized bootstrapping method designed to work with time series data, called the stationary bootstrap. Then we will basically treat each bootstrap-resampled distribution as a poor man’s posterior distribution and go from there. Here’s what we get when we run the stationary bootstrap on a single, representative site:

…Huh. The bootstrap results are not totally crazy, but there are clearly a couple of issues. First and most obviously, the bootstrap estimates of have come out sort of discretized for some reason—there are only 6 unique values—and with a substantial fraction of the estimates equal to exactly zero (which makes them unusable). Do we really think the posterior distribution for truly has this kind of shape? Well, no, probably not. So, if you’ll indulge me in some admittedly ad-hoc adjustment, I’d like to apply a correction to “smooth out” the marginal distribution of the estimates. You can think of this as applying a bit of regularization “by hand.”

Now, I’m willing to trust that the mean and standard deviation of this distribution are more or less right; it’s just the shape that I don’t trust. So I’ll apply a two-stage^{5} transformation that reshapes the empirical distribution into the Beta distribution with the same mean and variance. Note that we “reshape” the existing distribution of values, rather than simply taking new draws from the corresponding Beta distribution, so that we preserve the form of the dependence between the and estimates. So here’s a graphical comparison of the old and new marginal distributions:

Now if we turn our attention to the bootstrap estimates of , we see a less obvious but still somewhat strange issue, which is that the conditional distribution at has a weird bimodal shape—in particular, there is an odd cluster of estimates that are effectively 0. Closer inspection of the estimates as a function of the flood height sheds some light on what’s going on here:

Recall from the Figure 8 caption that the estimates are predicted values from a weighted GAM fit (on the logit-logit scale) to the simple estimates resulting from the procedure illustrated Figure 7, because the simple estimates are rough and unstable, particularly at extreme values of the flood height . Basically what is happening here is that, for a nontrivial minority of the bootstrap-resampled datasets, the simple estimates become empirically undefined (i.e., they involve division by zero) above a certain value of , because by chance the resampled dataset ended up with no data points near or beyond that value of . So when that happens, the GAM simply does a linear extrapolation from the highest non-missing to the value corresponding to Harvey, and uses the extrapolated value as the bootstrap estimate. That works fine in a lot of cases, but in other cases it can yield wacky/divergent estimates, depending on where along the sequence the simple estimates become missing.

It seems like a good idea to discard the divergent bootstrap estimates. One way to do this is to fit a mixture model to the bootstrap estimates, use the model to classify each estimate as either probably coming from the “true” bootstrap distribution or probably coming from a (much wider) contaminant distribution, and then to discard the estimates classified as probably being contaminants. That process is illustrated in the right panel of Figure 11, which shows that it works pretty well, at least in the sense that the mixture model seems pretty plausible.

After applying this correction, along with the previously described adjustment to the estimates, we end up with an adjusted joint distribution of bootstrap estimates that looks like this:

So if we compute the adjusted bootstrap distributions for all 21 sites, then after much ado, that gives us the following final estimates for each site:

### Conclusion and limitations

As you can see, most of the *N*-year flood estimates have a *lot* of uncertainty around them. Estimating the risk of extremely rare events is an inherently hard statistical problem, so it’s not surprising that it’s hard to pin down precise estimates.

Of course, the method I’ve applied here is just one way to approach the problem. Basically it’s the first thing I thought of. I initially intended this to be a rough-and-ready analysis to satisfy my curiosity, but after following the rabbit hole of Markov dependence and bootstrapping complications a bit further than I anticipated, I’m not so sure if the “rough-and-ready” description is still apt.

There is a temptation to try to aggregate in some way the individual *N*-year flood estimates at each site, in an attempt to estimate the *N*-year flood level of the Harvey floods as a single, atomic event. After all, media reports of the risk/severity of the Harvey floods just present a single figure like “500 year flood” or “1000 year flood,” not broken down by site. This kind of aggregation is problematic in at least a couple ways. First, on a more conceptual level, it’s not clear which sites should be aggregated over. A simple approach might be to simply aggregate over all the sites in Harris county, which is sort of what I did (I mean, I avoided aggregation, but I did simply analyze all the sites in Harris county). According to my girlfriend—who actually does know something about hydrology—what would probably make more sense would be to aggregate over all the sites within a particular watershed, ignoring geologically arbitrary county lines. Second, on a more methodological level, the statistical problem of estimating the probability of a rare event across the entire constellation of 21 sites is quite formidable. For example, a simple extension of my method would involve first estimating a 21-dimensional joint density of the flood heights at all gage sites… the thought of which is enough to make one cringe. (I actually spent quite a while trying to get this to work, but ultimately just didn’t trust the answers I was getting, so I decided not to post any of that here.)

This method I used here does not take into account the fairly obvious possibility that the yearly risk of Harvey-level flood events may not be constant, but instead may be increasing over time. Including such time-dependence in the model would make this already difficult statistical problem even more challenging. This increasing-risk hypothesis is certainly plausible, but is quite complicated to think about because it depends on a lot of factors, such as the rate and nature of current/future urban development in the Houston area, preparatory measures that may be taken up by Houston, the possibly increasing frequency of large hurricanes due to climate change, etc. Regarding this last point, it seems that the evidence for an increasing frequency of Category 4/5 hurricanes is rather mixed.

Finally, there is some reason to think that almost any method applied immediately in the wake of Hurricane Harvey will lead to some degree of overestimation of the probability of similar events in the future (or, equivalently, *N*-year flood estimates that are too small). This point is actually quite subtle, but bear with me. Hypothetically, across many possible timelines, if we always waited for a particular event (such as a flood of a certain size) to occur and then estimated the probability of that event immediately after it occurred, our probability estimates would end up inflated to some extent (on average across the timelines) due to a form of selection bias that seems related to the recently renewed discussion of hot hand effects. For a simplified model of how this selection bias would work, consider a situation where we are observing a sequence of tosses of an unfair coin that comes up heads (H) with probability , and we construct an estimate by waiting until we observe the first H and then computing the proportion of H outcomes in the sequence. Then

,

where is the probability mass function for a Geometric distribution. On average, this exceeds when is less than about 0.35 or so, and by quite a lot when H is rare. For example, if , then , so that the estimates overstate the probability of heads by almost a factor of 5 on average. This is not exactly what’s going on in our Harvey analysis, but it’s quite similar. It’s hard to know what to do (if anything) about this kind of selection bias, but it’s interesting to think about.

### Footnotes

^{1} More specifically, that floods follow a Bernoulli process over years or a Poisson process in continuous time

^{2} The USGS consistently uses the spelling “gage” instead of the more familiar (to me) “gauge.” Your guess as to why is as good as mine. Anyway, I’ll follow the USGS’s spelling convention.

^{3} Note that I use the daily *means* here, even though arguably it might make more sense to use the daily *maximums*. But unfortunately the daily maximums are only available for 3 of the gages in Harris county.

^{4} See the R Markdown document for the derivation of this.

^{5} The first step of the transformation—transforming the observed distribution to a Uniform distribution—is achieved in this case by taking , where is the number of values, and *breaking ties randomly* so that the 6 unique values become a smooth distribution.