Image credit: National Renewable Energy Laboratory, Colorado State University

Most summers, Coloradoans flock to the majestic Rocky Mountains with their beautiful hikes and various mountain activities. This is the case, at least, unless poor air quality forces them indoors. For me, this occurred on a smoky July day in 2020, when surreal “snowing” ash sprinkling down from nearby wildfires forced us to evacuate the pickleball courts.

Between wildfires and pollution, Denver’s summer air often leaves room for improvement. Sadly, the Rockies don’t seem so enticing when they are obscured behind a polluted haze.

In August 2022, I noticed my RTD bus was more crowded than usual, and I was not asked to scan my bus pass. This is how I learned of Denver’s 2022 “Zero fare for cleaner air” initiative. Throughout the month, my packed bus led me to believe the policy did work to increase ridership.

My story was validated by the RTD; according to the final RTD report, RTD did indeed see 22% increased ridership during the free-fare month, up 36% from the August prior. This increase led some to conclude that the campaign was a huge success, and also to the expansion of the program in 2023.

But wait… the campaign is called “Zero fare for better air”. So for this to really be a success, the policy change should be measurable in better air quality, not just ridership. To this point, the report concluded that “impacts to air quality are difficult to quantify”. They mention this difficulty is due to no baseline provided. So we’re left wondering – did it work? Did we actually have cleaner air in August of 2022?

Recently, my team investigated how the Covid-19 pandemic affected congestion and air quality in cities across the US (we found that it did). In this post I use similar outcomes and methods to determine the impact of this policy in Denver.

There are plenty of important pollutants to worry about in our air, but automobile traffic contributes especially to nitrous oxide (NO2) and ozone (O3). We’ll consider each of these using data from the EPA’s Air Quality System.

Note

Code and data for this and other blog posts are available here.

Here are plots of the historical daily data for NO2 in Denver.

We can use this historical data to build a forecast of what August’s NO2 levels would be using forecasting methodology available in the `fastTS`

R package that can handle this kind of seasonal data. The series is logged (+10) prior to modeling. We include weekday and month indicator variables and a natural cubic basis spline for time. We computed 30-day-ahead predictions and tested whether these predictions were significantly different than the observed daily values during the zero-fare period. As some months were easier to forecast than others (August was easier to forecast than winter months), we also use heteroskedasticity-corrected standard errors. To evaluate our model, a 10% test set was held out.

Our model can predict daily NO2 on the log scale to within about 0.175 units, with an out-of-sample of 0.533 (about 53% of the variation in this outcome can be explained by historical patterns in our model).

The observed daily NO2 values were on average a factor of 0.932 lower, or 6.8% lower, during the zero-fare month compared to their forecasted values (95% CI: 0.921, 0.944). This is strong evidence of a decrease in daily NO2 during the month of August 2022 than would have been expected historically.

Here are plots of the historical daily data for ozone in Denver.

Modeling of ozone data proceeded similarly, although no outcome transformation was used.

Our model can predict daily ozone to within about 0.005 units, with an out-of-sample of 0.67 (about 67% of the variation in this outcome can be explained by historical patterns in our model).

The observed daily ozone values were on average -0.001 parts per million lower during the zero-fare month compared to their forecasted values (95% CI: -0.002, -0.001). This doesn’t show evidence of a change in daily ozone during the month of August 2022 in comparison to would have been expected historically.

- Daily average NO2 in Denver during the zero fare month was about 7% less than forecasts (p < 0.001)!
- No observable change was seen in ozone relative to forecasts.
- There is room for more to be done to improve Denver’s air quality.

Ozone and NO2 are affected by many things on a daily basis, which were not controlled for in this analysis. A more effective analysis would control for these things, which might have improved the precision of the model estimates or better account for the possibility of confounding. Both outcomes are also not perfectly measured by the AQS stations scattered about the city of Denver; there’s always the possibility that more accurate or more granular data could better show an effect of the zero-fare policy.

If the method we used for determining the effect of intervention were flawed, we might expect to see high rejection rates for any other subset of 31 days. We can check this by reproducing the same method for 31-day chunks of time surrounding August 2022. Below is a table of the estimated effect under the same methodology for every cut point listed, including FDR-adjusted (and nominal) p-values as well as effect size estimates.

cut | estimate | ci_lb | ci_ub | p.value | p.adj |
---|---|---|---|---|---|

2021-01-01 | 1.008 | 0.96 | 1.06 | 0.86 | 0.95 |

2021-02-01 | 0.989 | 0.95 | 1.03 | 0.80 | 0.95 |

2021-03-01 | 1.082 | 1.04 | 1.13 | 0.054 | 0.21 |

2021-04-01 | 0.997 | 0.97 | 1.02 | 0.90 | 0.95 |

2021-05-01 | 1.035 | 1.01 | 1.06 | 0.094 | 0.29 |

2021-06-01 | 1.030 | 1.01 | 1.05 | 0.17 | 0.37 |

2021-07-01 | 0.965 | 0.95 | 0.98 | 0.033 | 0.16 |

2021-08-01 | 0.977 | 0.96 | 0.99 | 0.18 | 0.37 |

2021-09-01 | 0.999 | 0.97 | 1.03 | 0.99 | 0.99 |

2021-10-01 | 1.004 | 0.97 | 1.04 | 0.91 | 0.95 |

2021-11-01 | 0.944 | 0.91 | 0.98 | 0.12 | 0.32 |

2021-12-01 | 0.977 | 0.94 | 1.02 | 0.58 | 0.89 |

2022-01-01 | 1.132 | 1.09 | 1.18 | 0.002 | 0.019 |

2022-02-01 | 1.024 | 0.98 | 1.07 | 0.59 | 0.89 |

2022-03-01 | 0.958 | 0.93 | 0.99 | 0.15 | 0.36 |

2022-04-01 | 0.920 | 0.89 | 0.95 | 0.009 | 0.068 |

2022-05-01 | 0.986 | 0.96 | 1.01 | 0.54 | 0.89 |

2022-06-01 | 1.019 | 0.99 | 1.04 | 0.44 | 0.81 |

2022-07-01 | 0.970 | 0.95 | 0.99 | 0.098 | 0.29 |

2022-08-01 | 0.932 | 0.92 | 0.94 | < 0.001 | < 0.001 |

2022-09-01 | 0.989 | 0.97 | 1.01 | 0.66 | 0.93 |

2022-10-01 | 0.990 | 0.96 | 1.02 | 0.70 | 0.93 |

2022-11-01 | 1.008 | 0.97 | 1.05 | 0.85 | 0.95 |

2022-12-01 | 1.103 | 1.05 | 1.15 | 0.032 | 0.16 |

Come August 2023, Denver will roll out the program again and I will revisit this analysis to see whether zero fares produce *observably* cleaner air throughout the month. Please check out their website to sign up and participate.

Note

Again, code and data for this and other posts are available here. This post was updated on 2/15/2024 to point to the `fastTS`

R package, which is an updated version of `srlTS`

, and again on 6/11/2024 based on a bug fix in `fastTS`

1.0.0, which strengthened the observed effect of NO2.

I have heard some variant of this question from clinicians and researchers from many fields of science. While usually asked in earnest, **this question is a dangerous one**; the sheer number of interactions can greatly inflate the number of false discoveries in the interactions, resulting in difficult-to-interpret models with many unnecessary interactions. Still, there are times when these expeditions are necessary and fruitful. Thankfully, useful tools are now available to help with the process. This article discusses two regularization-based approaches: Group-Lasso INTERaction-NET (glinternet) and the Sparsity-Ranked Lasso (SRL). The glinternet method implements a hierarchy-preserving selection and estimation procedure, while the SRL is a hierarchy-preferring regularization method which operates under ranked sparsity principles (in short, ranked sparsity methods ensure interactions are treated more skeptically than main effects *a priori*).

The **sparseR** package has been designed to make dealing with interactions and polynomials much more analyst-friendly. Building on the **recipes** package, **sparseR** has many built-in tools to facilitate the prepping of a model matrix with interactions and polynomials; these features are presented in the package website located at https://petersonr.github.io/sparseR/. The package is available on CRAN and can be installed and loaded with the code below

```
install.packages("sparseR")
library(sparseR)
```

The simplest way to implement the SRL in **sparseR** is via a single call to the `sparseR()`

function, here demonstrated with Fisher’s `iris`

data set. 10-fold cross-validation is used by default, so we set the `seed = 1`

here for reproducibility.

```
data(iris)
srl <- sparseR(Sepal.Width ~ ., data = iris, k = 1, seed = 1)
srl
```

```
Model summary @ min CV:
-----------------------------------------------------
```

`Using a basic kernel estimate for local fdr; consider installing the ashr package for more accurate estimation. See ?local_mfdr`

```
lasso-penalized linear regression with n=150, p=18
(At lambda=0.0015):
Nonzero coefficients: 10
Cross-validation error (deviance): 0.07
R-squared: 0.62
Signal-to-noise ratio: 1.64
Scale estimate (sigma): 0.267
SR information:
Vartype Total Selected Saturation Penalty
Main effect 6 4 0.667 2.45
Order 1 interaction 12 6 0.500 3.46
Model summary @ CV1se:
-----------------------------------------------------
lasso-penalized linear regression with n=150, p=18
(At lambda=0.0070):
Nonzero coefficients: 7
Cross-validation error (deviance): 0.08
R-squared: 0.57
Signal-to-noise ratio: 1.33
Scale estimate (sigma): 0.285
SR information:
Vartype Total Selected Saturation Penalty
Main effect 6 3 0.500 2.45
Order 1 interaction 12 4 0.333 3.46
```

The `summary`

function produces additional details:

`summary(srl, at = "cv1se")`

```
lasso-penalized linear regression with n=150, p=18
At lambda=0.0070:
-------------------------------------------------
Nonzero coefficients : 7
Expected nonzero coefficients: 1.38
Average mfdr (7 features) : 0.198
Estimate z mfdr Selected
Species_setosa 0.810513 17.9513 < 1e-04 *
Sepal.Length 0.191210 9.3371 < 1e-04 *
Petal.Length:Petal.Width 0.119640 5.0379 < 1e-04 *
Petal.Width:Species_versicolor 0.275341 3.1640 0.055680 *
Sepal.Length:Petal.Length -0.052711 -3.2466 0.078121 *
Sepal.Length:Species_setosa 0.062782 2.5978 0.251076 *
Species_versicolor -0.001653 -0.8052 1.000000 *
```

We see that two models are displayed by default corresponding to two “smart” choices for the penalization parameter . The first model printed refers to the model where is set to minimize the cross-validated error, while the second one refers to a model where is set to a value such that the model is as sparse as possible while still being within 1 SD of the minimum cross-validated error. Visualizations are also available via sparseR that can help visualize both the solution path and the resulting model (interactions can be very challenging to interpret without a good figure!)

`plot(srl)`

`effect_plot(srl, "Petal.Width", by = "Species", at = "cvmin")`

`effect_plot(srl, "Petal.Width", by = "Species", at = "cv1se")`

Note that while ranked sparsity principles were motivated by the estimation of the lasso (Peterson & Cavanaugh 2022), they can also be implemented with MCP, SCAD, or elastic net and for binary, normal, and survival data. Finally, sparseR includes some functionality to perform forward-stepwise selection using a sparsity-ranked modification of BIC, as well as post-selection inferential techniques using sample splitting and bootstrapping.

Some argue that when it comes to interactions, hierarchy is very important (i.e., an interaction shouldn’t be included in a model without its constituent main effects). While ranked sparsity methods do *prefer* hierarchical models, they can often still produce non-hierarchical ones. The **glinternet** package and the function of the same name uses regularization for model selection under hierarchy constraint, such that all candidate models are hierarchical. **Glinternet** can handle both continuous and categorical predictors, but requires pre-specification of a numeric model matrix. It can be performed as follows:

```
# install.packages("glinternet")
library(glinternet)
library(dplyr)
X <- iris %>%
select(-Sepal.Width) %>%
mutate(Species = as.numeric(Species) - 1)
set.seed(321)
cv_fit <- glinternet.cv(X, Y = iris$Sepal.Width, numLevels = c(1,1,1,3))
```

The `cv_fit`

object contains necessary information from the cross-validation procedure and the fits themselves stored in a series of lists. A more in-depth tutorial to extract coefficients (and facilitate a model interpretation) using the **glinternet** package can be found at https://strakaps.github.io/post/glinternet/. Importantly, both the **glinternet** and **sparseR** methods have associated predict methods which can yield predictions on new (or the training) data, shown below. For comparison, we also fit a “main effects only” model with **sparseR** by setting `k = 0`

.

```
me <- sparseR(Sepal.Width ~ ., data = iris, k = 0, seed = 333)
p_me <- predict(me)
p_srl <- predict(srl)
p_gln <- as.vector(predict(cv_fit, X))
```

With a little help from the **yardstick** package’s `metrics()`

function, we can compare the accuracy of each model’s predictions using root-mean-squared error (RMSE), R-squared (RSQ), and mean absolute error (MAE); see table below. Evidently, **glinternet** and SRL are similar in terms of their predictive performance. However, both outperform the main effects model considerably, suggesting interactions among other variables do have signal worth capturing when predicting `Sepal.Width`

.

```
gln_res <- tibble(p_gln, y = iris$Sepal.Width) %>%
yardstick::metrics(y, p_gln) %>%
rename("glinternet"= .estimate)
srl_res <- tibble(p_srl, y = iris$Sepal.Width) %>%
yardstick::metrics(y, p_srl) %>%
rename("SRL"= .estimate)
me_res <- tibble(p_me, y = iris$Sepal.Width) %>%
yardstick::metrics(y, p_me) %>%
rename("Main effects only"= .estimate)
results_table <- gln_res %>%
bind_cols(srl_res[,3]) %>%
bind_cols(me_res[,3]) %>%
rename("Metric" = .metric) %>%
mutate(Metric = toupper(Metric)) %>%
select(-.estimator)
```

Metric | glinternet | SRL | Main effects only |
---|---|---|---|

RMSE | 0.24 | 0.24 | 0.26 |

RSQ | 0.69 | 0.69 | 0.63 |

MAE | 0.19 | 0.19 | 0.20 |

The SRL and other sparsity-ranked regularization methods implemented in **sparseR** would not be possible without the **ncvreg** package, which performs the heavy-lifting in terms of model fitting, optimization, and cross-validation. The **hierNet** package is another hierarchy-enforcing procedure that may yield better models than **glinternet**, however the latter is more computationally efficient especially for situations with a medium-to-large number of covariates. Finally, when interactions or polynomials are included in models, figures are truly worth a thousand words, and packages such as **visreg** and **sjPlot** have great functionality for plotting the effects of interactions.

- Bien J and Tibshirani R (2020). hierNet: A Lasso for Hierarchical Interactions. R package version 1.9. https://CRAN.R-project.org/package=hierNet
- Breheny P and Burchett W (2017). Visualization of Regression Models Using visreg. The R Journal, 9: 56-71.
- Breheny P and Huang J (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.
- Kuhn M and Vaughan D (2021). yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. https://CRAN.R-project.org/package=yardstick
- Lim M and Hastie T (2020). glinternet: Learning Interactions via Hierarchical Group-Lasso Regularization. R package version 1.0.11. https://CRAN.R-project.org/package=glinternet
- Lüdecke D (2021). sjPlot: Data Visualization for Statistics in Social Science. R package version 2.8.8. https://CRAN.R-project.org/package=sjPlot
- Peterson R (2021). sparseR: Variable selection under ranked sparsity principles for interactions and polynomials. https://github.com/petersonR/sparseR/.
- Peterson, R, Cavanaugh, J. Ranked sparsity: a cogent regularization framework for selecting and estimating feature interactions and polynomials. AStA Adv Stat Anal 106, 427–454 (2022). https://doi.org/10.1007/s10182-021-00431-7

Note

This post was originally published in the Biometric Bulletin (2021) Volume 38 Issue 3.

*Feed your data addiction.*

**Data**: things known or assumed as facts, making the basis of reasoning or calculation**Diction**: 1) the choice and use of words and phrases in speech or writing. 2) the choice of words especially with regard to correctness, clearness, or effectiveness.

If you believe data can and should be used in all facets of life, this blog is for you. Its goal is to describe interesting studies, questions, and stories in terms of the data involved.

In addition to the play on “Data Addiction”, Data Diction is also a play on the very commonly used term of “Data dictionary”, a term with which statistical practitioners should be familiar.

- My initial goal is to post ~1 piece every other month. Most posts will involve data analysis, and all code used to pull and analyze data will be made available upon request, if not included in the post already.
*Damn it, Jim, I am a statistician, not a scientist.*If you are an expert on a topic that you believe I’ve butchered, please leave a constructive comment with any corrections or caveats that should be made.- This is a blog… not a peer-reviewed scientific journal. All content, though hopefully based on data in well-described and well-dictated ways, should be regarded as having come from a blog.

I created the logo using ozone data from Denver that I compiled for the post about Denver’s 2021 “Zero Fare for Cleaner Air”.

```
library(tidyverse)
df_ozone <- read_csv(here::here("posts/did-denver-zero-fare-policy-work/ozone_data-91-23.csv"))
df_ozone$daily_avg <- imputeTS::na_kalman(df_ozone$daily_avg)
ggplot(df_ozone, aes(x=date_local, y=daily_avg)) +
geom_line(col = "grey", alpha = .85) +
stat_smooth(fill = "lightblue4", level = .999999) +
ylab("Daily average Ozone in Denver") +
xlab("")
```