Before you embark on your “career” as a statistician, you must purge yourself of a childish misconception: that our job is to seek truth. Truth is stubborn, unpredictable, and worst of all, often unpublishable. Scientists crave confidence, the journals crave significance, and we, if we are clever, can provide both without the nuisance of real rigor.
In this post series, I will instruct in the Statistical Dark Arts. Today, I’ll describe how to always ensure a publishable result with model selection and unadjusted post selection inferences (UPSIs).
Why UPSIs?
You will frequently come across scientists with ambitious ideas, saying things like:
I’ve conceived a brilliant new way to treat chronic pain effectively. I don’t want to waste time, efforts, and money on a result that ends up being insignificant… How can I guarantee a significant result??
No statistician wants to contribute to a null finding. Interpreting those is too hard! I’ll let you in on a statistical secret: there IS a way to guarantee statistical significance – and it’s actually extremely common in science.
It’s called unadjusted post-selection inference (UPSI).
With good model selection tools and UPSIs in your toolbelt, you can turn any negative study into a positive one – guaranteed.
Example: a study for treatment of chronic pain
Say the researcher who approached you is planning a cross-over study where patients were given one of two treatments for chronic pain. Patients will record their overall pain each day for 1 week while undergoing treatment A, and 1 week while undergoing treatment B.
Investigators are hoping to determine the difference between treatments in the average pain score. Our outcome \(Y_i\) is thus a continuous measure for pain reduction: for each subject, this value reflects the difference in that subject’s weekly average pain score between the two candidate treatments.
In addition to determining whether the treatment works on average in their population, investigators wish to determine specific subgroups that might see the most benefit. They are primarily interested in subgroups by age and sex, as well as patient self-reported data such as alcohol consumption and physical activity.
Note
Make sure they have an exhaustive list of candidate effect modifiers; I’d even suggest a few new ones like left/right handed-ness and coffee consumption.
Here’s a way to list all the possible subgroups of this set of variables:
This is a LOT of subgroups! And it’s way too many subgroups to possibly sift through, at least without tipping off the statistical reviewers about multiplicity.
CautionMultiplicity
Gone, alas, are the carefree days when we could run tons of tests, report the precious few that were significant, and quietly ignore the rest. Some meddlesome truth-seekers eventually caught on and began scolding us for our ingenuity, slapping the sinister label multiplicity onto our beloved golden-egg-laying goose. Now the reputable journals know to look for it. Our approach must become more subtle.
Here’s the key: turn to model selection. Write in the analysis plan:
We will use forward selection to determine which variables or interactions improve our model’s predictions, adding each candidate predictor in a stepwise fashion until the AIC indicates the model’s predictions can no longer be improved.
Sounds rigorous and honest, right?
This simple trick can guarantee significant results, and help you ensure your study will find a statistically significant result by the end, even when none exists!
How about that?! ZERO RISK! (well, to us anyway).
As a bonus, the more interactions we include that are truly null, the more unnecessarily opaque our model becomes – talk about a win-win.
Virtual Trial 1
Let me illustrate via a simulation. Let’s virtually collect \(n = 100\) patients in our trial.
library(tidyverse)set.seed(21)n <-100# Sample size # Simulate recruitment (y: outcome)y <-rnorm(n)# Simulate recruitment (assume each new person has random subgroup)x_idx <-sample(1:nrow(subgroups), n, replace =TRUE) X <- subgroups[x_idx,]simdata <-tibble(y = y, X)simdata
# A tibble: 100 × 7
y age sex hand coffee alcohol physical_activity
<dbl> <fct> <fct> <fct> <fct> <fct> <fct>
1 0.793 36-50 Female Left 0 2 1
2 0.522 80+ Female Left 2+ 1 3+
3 1.75 36-50 Female Left 2+ 1 0
4 -1.27 66-80 Female Right 0 3+ 0
5 2.20 51-65 Male Right 2+ 0 2
6 0.433 36-50 Male Left 0 2 1
7 -1.57 66-80 Male Right 0 2 0
8 -0.935 80+ Female Left 1 3+ 3+
9 0.0635 36-50 Female Left 2+ 0 2
10 -0.00239 66-80 Male Left 0 2 0
# ℹ 90 more rows
We need to create interactions to select from:
# Create model matrix w/all interactionsX2 <-model.matrix(y ~ . * ., data = simdata)[,-1]# Sanity check: make sure these columns represent main effects + interactionssample(colnames(X2), 10)
# combine into data set for model fittingsimdata2 <-data.frame(y=y, X2)
OK! We have 93 candidate predictors for model selection. Don’t worry, unless our paper’s reviewers are extremely thorough, we don’t have to report this number. Our final model will have many fewer.
Model selection is a simple task. The following code uses forward step-wise selection with AIC to build an optimally-predicting model, per our analysis plan.
library(selectInferToolkit)fit_aic <-select_stepwise_ic(y ~ ., data = simdata2, direction ="forward")
Caution
While other packages provide p-values after selection, most do so without being transparent, making it difficult to ensure they are truly UPSIs. If you aren’t careful with these less transparent tools, you might report p-values that are somehow diabolically adjusted towards insignificance. In contrast, the selectInferToolkit package, available on GitHub, helps you explicitly use UPSIs for inference:
infer_upsi(fit_aic, data = simdata2) %>%tidy() %>%filter(coef !=0)
term
coef
ci_low
ci_high
p_value
(Intercept)
0.07
−0.07
0.21
0.315
age36.50
0.17
−0.06
0.41
0.156
age66.80
−0.35
−0.55
−0.15
<0.001
handRight
0.23
−0.02
0.47
0.076
age80..sexFemale
−0.29
−0.51
−0.07
0.013
age36.50.handRight
−0.21
−0.43
0.01
0.071
age80..handRight
−0.22
−0.44
0.00
0.053
age66.80.coffee1
0.45
0.19
0.71
<0.001
age36.50.coffee2.
0.19
−0.01
0.40
0.070
age51.65.coffee2.
0.33
0.16
0.49
<0.001
age36.50.alcohol1
−0.13
−0.31
0.06
0.178
age66.80.alcohol1
−0.26
−0.50
−0.01
0.042
age66.80.alcohol2
−0.19
−0.40
0.02
0.081
age51.65.physical_activity1
−0.33
−0.50
−0.17
<0.001
age51.65.physical_activity2
−0.18
−0.35
−0.01
0.039
sexFemale.alcohol1
0.33
0.15
0.50
<0.001
sexFemale.alcohol2
0.19
0.02
0.37
0.036
sexFemale.alcohol3.
−0.17
−0.37
0.03
0.091
sexFemale.physical_activity2
−0.27
−0.44
−0.10
0.003
handRight.alcohol3.
−0.16
−0.36
0.04
0.123
handRight.physical_activity3.
−0.29
−0.48
−0.10
0.004
coffee2..alcohol3.
0.12
−0.05
0.29
0.172
coffee1.physical_activity1
0.17
0.01
0.33
0.040
alcohol3..physical_activity3.
0.25
0.08
0.42
0.006
And there you have it!
For many subgroups, treatment A worked great (increased pain scores, significant positive effects). Treatment A didn’t work so well for others (with negative significant effects), for whom it actually appears to decrease pain scores. These results indicate we could stand to maximize pain by giving certain patients A, and other patients B. I wouldn’t fret too much about how we included interactions without their constituent main effects; leave that subtlety to the media to figure out.
Caution
Sometimes, researchers want to actually decrease pain scores; you might want to check on this.
What’s with the high p-values?
I know what you’re thinking - won’t it be confusing to include all those results with high p-values?
An expert tip: if you want to make fewer discoveries and get even lower p-values, simply use BIC instead of AIC. BIC only lets the most significant results into the model, and you can say it’s asymptotically consistent, which sounds equally rigorous to AIC, which is asymptotically efficient.
In this example, here’s the model selected via stepwise BIC:
fit_bic <-select_stepwise_ic(y ~ ., data = simdata2, direction ="forward", penalty ="BIC")
infer_upsi(fit_bic, data = simdata2) %>%tidy() %>%filter(coef !=0)
term
coef
ci_low
ci_high
p_value
(Intercept)
0.07
−0.09
0.24
0.393
age36.50
0.27
0.10
0.45
0.003
age66.80
−0.38
−0.57
−0.18
<0.001
age66.80.coffee1
0.23
0.04
0.42
0.019
age51.65.coffee2.
0.29
0.11
0.46
0.002
age51.65.physical_activity1
−0.24
−0.42
−0.07
0.007
sexFemale.physical_activity2
−0.32
−0.49
−0.15
<0.001
Voila – the headlines practically write themselves!
How UPSIs work so well
Let me let you in on a little secret. If you were paying attention, you’d have seen the outcomes in the preceding example, \(Y_i\), were generated completely randomly. There was, by design, no true relationship to discover at all. Yet, I was virtually guaranteed to find a significant result. This is due to multiplicity’s clever younger brother, selective inference.
Selective inference is a difficult problem to solve; in fact, some call it impossible. Others hold out hope. At any rate, UPSIs have inertia and incentives on their side.
Later posts to this blog will go into specific solutions to the selective inference problem and how to implement them in R. Fair warning though, these solutions are often less promising and will certainly be less significant.
For now, let me finish convincing you that UPSIs are indeed effective at resolving the “publish or perish” dilemma; with USPIs, we can thrive at both! It should suffice to repeat this simulation again, as though it were performed in a parallel universe.
Virtual Trial 2
simdata2$y <-rnorm(n)fit_bic <-select_stepwise_ic(y ~ ., data = simdata2, direction ="forward", penalty ="BIC")
infer_upsi(fit_bic, data = simdata2) %>%tidy() %>%filter(coef !=0)
term
coef
ci_low
ci_high
p_value
(Intercept)
0.03
−0.14
0.21
0.715
age66.80.sexFemale
−0.25
−0.43
−0.07
0.008
age80..coffee2.
0.30
0.11
0.50
0.003
age80..physical_activity1
−0.26
−0.46
−0.07
0.009
Indeed, we still discover “significant” relationships. And, these are completely different treatment effect modifiers. In this parallel universe, our scientist friends publish completely different effects and will no doubt pour their hard-fought resources into what we well know are wild goose chases.
This is not a fluke.
In case you are still skeptical, let’s repeat this 50 times, as though we ran the study in 50 parallel universes.
sim_results <-list()for(s in1:50) { simdata2$y <-rnorm(n) fit_bic <-select_stepwise_ic(y ~ ., data = simdata2, direction ="forward", penalty ="BIC") sim_results[[s]] <-infer_upsi(fit_bic, data = simdata2) %>%tidy() %>%filter(coef !=0) }all_selections <-bind_rows(sim_results, .id ="simulation")
Using BIC, we found at least one significant effect in 46 of 50 trials, and on average these trials found 2.6 significant effects. We achieved falsely significant results 92% of the time.
If you’re thinking “hmm, my study still might have a chance of failing”, well good news. AIC finds a significant result nearly 100% of the time. We also only considered pairwise interactions between 6 candidate predictors. To minimize the risk of a negative study, one could consider higher level interactions or a greater number of subgroups, and a positive result quickly becomes a sure thing.
As I’ve said, UPSIs have incentives and inertia on their side. The more broad we make this practice, the better we can do at ensuring that the publications keep flowing, regardless of the truth.