Data Diction

Did Denver’s 2022 ‘Zero Fare for Cleaner Air’ campaign actually work?

Ryan Peterson — Fri, 21 Jul 2023 00:00:00 GMT

A data-dictated look at whether Denver’s free August public transit policy had its intended effect on air quality.

Image credit: National Renewable Energy Laboratory, Colorado State University

Backstory

Most summers, Coloradoans flock to the majestic Rocky Mountains with their beautiful hikes and various mountain activities. This is the case, at least, unless poor air quality forces them indoors. For me, this occurred on a smoky July day in 2020, when surreal “snowing” ash sprinkling down from nearby wildfires forced us to evacuate the pickleball courts.

Between wildfires and pollution, Denver’s summer air often leaves room for improvement. Sadly, the Rockies don’t seem so enticing when they are obscured behind a polluted haze.

In August 2022, I noticed my RTD bus was more crowded than usual, and I was not asked to scan my bus pass. This is how I learned of Denver’s 2022 “Zero fare for cleaner air” initiative. Throughout the month, my packed bus led me to believe the policy did work to increase ridership.

My story was validated by the RTD; according to the final RTD report, RTD did indeed see 22% increased ridership during the free-fare month, up 36% from the August prior. This increase led some to conclude that the campaign was a huge success, and also to the expansion of the program in 2023.

But wait… the campaign is called “Zero fare for better air”. So for this to really be a success, the policy change should be measurable in better air quality, not just ridership. To this point, the report concluded that “impacts to air quality are difficult to quantify”. They mention this difficulty is due to no baseline provided. So we’re left wondering – did it work? Did we actually have cleaner air in August of 2022?

Recently, my team investigated how the Covid-19 pandemic affected congestion and air quality in cities across the US (we found that it did). In this post I use similar outcomes and methods to determine the impact of this policy in Denver.

Air Quality Data

There are plenty of important pollutants to worry about in our air, but automobile traffic contributes especially to nitrous oxide (NO2) and ozone (O3). We’ll consider each of these using data from the EPA’s Air Quality System.

Note

Code and data for this and other blog posts are available here.

NO2

Data visualizations

Here are plots of the historical daily data for NO2 in Denver.

Modeling

We can use this historical data to build a forecast of what August’s NO2 levels would be using forecasting methodology available in the fastTS R package that can handle this kind of seasonal data. The series is logged (+10) prior to modeling. We include weekday and month indicator variables and a natural cubic basis spline for time. We computed 30-day-ahead predictions and tested whether these predictions were significantly different than the observed daily values during the zero-fare period. As some months were easier to forecast than others (August was easier to forecast than winter months), we also use heteroskedasticity-corrected standard errors. To evaluate our model, a 10% test set was held out.

Our model can predict daily NO2 on the log scale to within about 0.175 units, with an out-of-sample of 0.533 (about 53% of the variation in this outcome can be explained by historical patterns in our model).

The observed daily NO2 values were on average a factor of 0.932 lower, or 6.8% lower, during the zero-fare month compared to their forecasted values (95% CI: 0.921, 0.944). This is strong evidence of a decrease in daily NO2 during the month of August 2022 than would have been expected historically.

Ozone

Data visualizations

Here are plots of the historical daily data for ozone in Denver.

Modeling

Modeling of ozone data proceeded similarly, although no outcome transformation was used.

Our model can predict daily ozone to within about 0.005 units, with an out-of-sample of 0.67 (about 67% of the variation in this outcome can be explained by historical patterns in our model).

The observed daily ozone values were on average -0.001 parts per million lower during the zero-fare month compared to their forecasted values (95% CI: -0.002, -0.001). This doesn’t show evidence of a change in daily ozone during the month of August 2022 in comparison to would have been expected historically.

Takeaways

Daily average NO2 in Denver during the zero fare month was about 7% less than forecasts (p < 0.001)!
No observable change was seen in ozone relative to forecasts.
There is room for more to be done to improve Denver’s air quality.

Limitations

Ozone and NO2 are affected by many things on a daily basis, which were not controlled for in this analysis. A more effective analysis would control for these things, which might have improved the precision of the model estimates or better account for the possibility of confounding. Both outcomes are also not perfectly measured by the AQS stations scattered about the city of Denver; there’s always the possibility that more accurate or more granular data could better show an effect of the zero-fare policy.

Sensitivity of method

If the method we used for determining the effect of intervention were flawed, we might expect to see high rejection rates for any other subset of 31 days. We can check this by reproducing the same method for 31-day chunks of time surrounding August 2022. Below is a table of the estimated effect under the same methodology for every cut point listed, including FDR-adjusted (and nominal) p-values as well as effect size estimates.

cut	estimate	ci_lb	ci_ub	p.value	p.adj
2021-01-01	1.008	0.96	1.06	0.86	0.95
2021-02-01	0.989	0.95	1.03	0.80	0.95
2021-03-01	1.082	1.04	1.13	0.054	0.21
2021-04-01	0.997	0.97	1.02	0.90	0.95
2021-05-01	1.035	1.01	1.06	0.094	0.29
2021-06-01	1.030	1.01	1.05	0.17	0.37
2021-07-01	0.965	0.95	0.98	0.033	0.16
2021-08-01	0.977	0.96	0.99	0.18	0.37
2021-09-01	0.999	0.97	1.03	0.99	0.99
2021-10-01	1.004	0.97	1.04	0.91	0.95
2021-11-01	0.944	0.91	0.98	0.12	0.32
2021-12-01	0.977	0.94	1.02	0.58	0.89
2022-01-01	1.132	1.09	1.18	0.002	0.019
2022-02-01	1.024	0.98	1.07	0.59	0.89
2022-03-01	0.958	0.93	0.99	0.15	0.36
2022-04-01	0.920	0.89	0.95	0.009	0.068
2022-05-01	0.986	0.96	1.01	0.54	0.89
2022-06-01	1.019	0.99	1.04	0.44	0.81
2022-07-01	0.970	0.95	0.99	0.098	0.29
2022-08-01	0.932	0.92	0.94	< 0.001	< 0.001
2022-09-01	0.989	0.97	1.01	0.66	0.93
2022-10-01	0.990	0.96	1.02	0.70	0.93
2022-11-01	1.008	0.97	1.05	0.85	0.95
2022-12-01	1.103	1.05	1.15	0.032	0.16

On the horizon

Come August 2023, Denver will roll out the program again and I will revisit this analysis to see whether zero fares produce observably cleaner air throughout the month. Please check out their website to sign up and participate.

Note

Again, code and data for this and other posts are available here. This post was updated on 2/15/2024 to point to the fastTS R package, which is an updated version of srlTS, and again on 6/11/2024 based on a bug fix in fastTS 1.0.0, which strengthened the observed effect of NO2.

Detecting interactions in R

Ryan Peterson — Tue, 20 Jun 2023 00:00:00 GMT

But what about interactions; are any of those significant?

I have heard some variant of this question from clinicians and researchers from many fields of science. While usually asked in earnest, this question is a dangerous one; the sheer number of interactions can greatly inflate the number of false discoveries in the interactions, resulting in difficult-to-interpret models with many unnecessary interactions. Still, there are times when these expeditions are necessary and fruitful. Thankfully, useful tools are now available to help with the process. This article discusses two regularization-based approaches: Group-Lasso INTERaction-NET (glinternet) and the Sparsity-Ranked Lasso (SRL). The glinternet method implements a hierarchy-preserving selection and estimation procedure, while the SRL is a hierarchy-preferring regularization method which operates under ranked sparsity principles (in short, ranked sparsity methods ensure interactions are treated more skeptically than main effects a priori).

Useful package #1: ranked sparsity methods via sparseR

The sparseR package has been designed to make dealing with interactions and polynomials much more analyst-friendly. Building on the recipes package, sparseR has many built-in tools to facilitate the prepping of a model matrix with interactions and polynomials; these features are presented in the package website located at https://petersonr.github.io/sparseR/. The package is available on CRAN and can be installed and loaded with the code below

install.packages("sparseR")
library(sparseR)

The simplest way to implement the SRL in sparseR is via a single call to the sparseR() function, here demonstrated with Fisher’s iris data set. 10-fold cross-validation is used by default, so we set the seed = 1 here for reproducibility.

data(iris)
srl <- sparseR(Sepal.Width ~ ., data = iris, k = 1, seed = 1)
srl


Model summary @ min CV:
-----------------------------------------------------

Using a basic kernel estimate for local fdr; consider installing the ashr package for more accurate estimation.  See ?local_mfdr

  lasso-penalized linear regression with n=150, p=18
  (At lambda=0.0015):
    Nonzero coefficients: 10
    Cross-validation error (deviance): 0.07
    R-squared: 0.62
    Signal-to-noise ratio: 1.64
    Scale estimate (sigma): 0.267

  SR information:
             Vartype Total Selected Saturation Penalty
         Main effect     6        4      0.667    2.45
 Order 1 interaction    12        6      0.500    3.46


Model summary @ CV1se:
-----------------------------------------------------
  lasso-penalized linear regression with n=150, p=18
  (At lambda=0.0070):
    Nonzero coefficients: 7
    Cross-validation error (deviance): 0.08
    R-squared: 0.57
    Signal-to-noise ratio: 1.33
    Scale estimate (sigma): 0.285

  SR information:
             Vartype Total Selected Saturation Penalty
         Main effect     6        3      0.500    2.45
 Order 1 interaction    12        4      0.333    3.46

The summary function produces additional details:

summary(srl, at = "cv1se")

lasso-penalized linear regression with n=150, p=18
At lambda=0.0070:
-------------------------------------------------
  Nonzero coefficients         :   7
  Expected nonzero coefficients:   1.38
  Average mfdr (7 features)    :   0.198

                                Estimate       z     mfdr Selected
Species_setosa                  0.810513 17.9513  < 1e-04        *
Sepal.Length                    0.191210  9.3371  < 1e-04        *
Petal.Length:Petal.Width        0.119640  5.0379  < 1e-04        *
Petal.Width:Species_versicolor  0.275341  3.1640 0.055680        *
Sepal.Length:Petal.Length      -0.052711 -3.2466 0.078121        *
Sepal.Length:Species_setosa     0.062782  2.5978 0.251076        *
Species_versicolor             -0.001653 -0.8052 1.000000        *

We see that two models are displayed by default corresponding to two “smart” choices for the penalization parameter . The first model printed refers to the model where is set to minimize the cross-validated error, while the second one refers to a model where is set to a value such that the model is as sparse as possible while still being within 1 SD of the minimum cross-validated error. Visualizations are also available via sparseR that can help visualize both the solution path and the resulting model (interactions can be very challenging to interpret without a good figure!)

plot(srl)

effect_plot(srl, "Petal.Width", by = "Species", at = "cvmin")

effect_plot(srl, "Petal.Width", by = "Species", at = "cv1se")

Note that while ranked sparsity principles were motivated by the estimation of the lasso (Peterson & Cavanaugh 2022), they can also be implemented with MCP, SCAD, or elastic net and for binary, normal, and survival data. Finally, sparseR includes some functionality to perform forward-stepwise selection using a sparsity-ranked modification of BIC, as well as post-selection inferential techniques using sample splitting and bootstrapping.

Useful package #2: hierarchy-preserving regularization via glinternet

Some argue that when it comes to interactions, hierarchy is very important (i.e., an interaction shouldn’t be included in a model without its constituent main effects). While ranked sparsity methods do prefer hierarchical models, they can often still produce non-hierarchical ones. The glinternet package and the function of the same name uses regularization for model selection under hierarchy constraint, such that all candidate models are hierarchical. Glinternet can handle both continuous and categorical predictors, but requires pre-specification of a numeric model matrix. It can be performed as follows:

# install.packages("glinternet")
library(glinternet)
library(dplyr)

X <- iris %>% 
  select(-Sepal.Width) %>% 
  mutate(Species = as.numeric(Species) - 1)

set.seed(321)
cv_fit <- glinternet.cv(X, Y = iris$Sepal.Width, numLevels = c(1,1,1,3))

The cv_fit object contains necessary information from the cross-validation procedure and the fits themselves stored in a series of lists. A more in-depth tutorial to extract coefficients (and facilitate a model interpretation) using the glinternet package can be found at https://strakaps.github.io/post/glinternet/. Importantly, both the glinternet and sparseR methods have associated predict methods which can yield predictions on new (or the training) data, shown below. For comparison, we also fit a “main effects only” model with sparseR by setting k = 0.

me <- sparseR(Sepal.Width ~ ., data = iris, k = 0, seed = 333)
p_me <- predict(me)
p_srl <- predict(srl)
p_gln <- as.vector(predict(cv_fit, X))

With a little help from the yardstick package’s metrics() function, we can compare the accuracy of each model’s predictions using root-mean-squared error (RMSE), R-squared (RSQ), and mean absolute error (MAE); see table below. Evidently, glinternet and SRL are similar in terms of their predictive performance. However, both outperform the main effects model considerably, suggesting interactions among other variables do have signal worth capturing when predicting Sepal.Width.

gln_res <- tibble(p_gln, y = iris$Sepal.Width) %>% 
  yardstick::metrics(y, p_gln) %>% 
  rename("glinternet"= .estimate) 
srl_res <- tibble(p_srl, y = iris$Sepal.Width) %>% 
  yardstick::metrics(y, p_srl) %>% 
  rename("SRL"= .estimate) 
me_res <- tibble(p_me, y = iris$Sepal.Width) %>% 
  yardstick::metrics(y, p_me) %>% 
  rename("Main effects only"= .estimate) 

results_table <- gln_res %>% 
  bind_cols(srl_res[,3]) %>% 
  bind_cols(me_res[,3]) %>% 
  rename("Metric" = .metric) %>% 
  mutate(Metric = toupper(Metric)) %>% 
  select(-.estimator)

Metric	glinternet	SRL	Main effects only
RMSE	0.24	0.24	0.26
RSQ	0.69	0.69	0.63
MAE	0.19	0.19	0.20

Other packages worth mentioning: ncvreg, hierNet, visreg, sjPlot

The SRL and other sparsity-ranked regularization methods implemented in sparseR would not be possible without the ncvreg package, which performs the heavy-lifting in terms of model fitting, optimization, and cross-validation. The hierNet package is another hierarchy-enforcing procedure that may yield better models than glinternet, however the latter is more computationally efficient especially for situations with a medium-to-large number of covariates. Finally, when interactions or polynomials are included in models, figures are truly worth a thousand words, and packages such as visreg and sjPlot have great functionality for plotting the effects of interactions.

References

Bien J and Tibshirani R (2020). hierNet: A Lasso for Hierarchical Interactions. R package version 1.9. https://CRAN.R-project.org/package=hierNet
Breheny P and Burchett W (2017). Visualization of Regression Models Using visreg. The R Journal, 9: 56-71.
Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.
Kuhn M and Vaughan D (2021). yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. https://CRAN.R-project.org/package=yardstick
Lim M and Hastie T (2020). glinternet: Learning Interactions via Hierarchical Group-Lasso Regularization. R package version 1.0.11. https://CRAN.R-project.org/package=glinternet
Lüdecke D (2021). sjPlot: Data Visualization for Statistics in Social Science. R package version 2.8.8. https://CRAN.R-project.org/package=sjPlot
Peterson R (2021). sparseR: Variable selection under ranked sparsity principles for interactions and polynomials. https://github.com/petersonR/sparseR/.
Peterson, R, Cavanaugh, J. Ranked sparsity: a cogent regularization framework for selecting and estimating feature interactions and polynomials. AStA Adv Stat Anal 106, 427–454 (2022). https://doi.org/10.1007/s10182-021-00431-7

Note

This post was originally published in the Biometric Bulletin (2021) Volume 38 Issue 3.

Welcome to Data Diction

Ryan Peterson — Wed, 01 Jun 2022 00:00:00 GMT

Data Diction

Feed your data addiction.

Data: things known or assumed as facts, making the basis of reasoning or calculation
Diction: 1) the choice and use of words and phrases in speech or writing. 2) the choice of words especially with regard to correctness, clearness, or effectiveness.

If you believe data can and should be used in all facets of life, this blog is for you. Its goal is to describe interesting studies, questions, and stories in terms of the data involved.

In addition to the play on “Data Addiction”, Data Diction is also a play on the very commonly used term of “Data dictionary”, a term with which statistical practitioners should be familiar.

Objectives

My initial goal is to post ~1 piece every other month. Most posts will involve data analysis, and all code used to pull and analyze data will be made available upon request, if not included in the post already.
Damn it, Jim, I am a statistician, not a scientist. If you are an expert on a topic that you believe I’ve butchered, please leave a constructive comment with any corrections or caveats that should be made.
This is a blog… not a peer-reviewed scientific journal. All content, though hopefully based on data in well-described and well-dictated ways, should be regarded as having come from a blog.

Logo

I created the logo using ozone data from Denver that I compiled for the post about Denver’s 2021 “Zero Fare for Cleaner Air”.

Code