Building Propensity Score Models

Lucy D’Agostino McGowan

Observational Studies

Confounding

Propensity scores

Rosenbaum and Rubin showed in observational studies, conditioning on propensity scores can lead to unbiased estimates of the exposure effect

There are no unmeasured confounders
Every subject has a nonzero probability of receiving either exposure

Propensity scores

Fit a logistic regression predicting exposure using known covariates

\[Pr(exposure = 1) = \frac{1}{1+\exp(-X\beta)}\]

Each individuals’ predicted values are the propensity scores

Propensity scores

library(tidyverse)
library(broom)

Propensity scores

glm(exposure ~ confounder_1 + confounder_2 + confounder_3 + ..., 
    data = df,
    family = binomial())

Propensity scores

glm(exposure ~ confounder_1 + confounder_2 + confounder_3 + ..., 
    data = df,
    family = binomial()) |>
  augment(type.predict = "response", data = df)

Propensity scores

Example

Photo by Anna CC-BY-SA-4.0

Historically, guests who stayed in a Walt Disney World resort hotel were able to access the park during “Extra Magic Hours” during which the park was closed to all other guests.

These extra hours could be in the morning or evening.

The Seven Dwarfs Mine Train is a ride at Walt Disney World’s Magic Kingdom. Typically, each day Magic Kingdom may or may not be selected to have these “Extra Magic Hours”.

We are interested in examining the relationship between whether there were “Extra Magic Hours” in the morning and the average wait time for the Seven Dwarfs Mine Train the same day between 9am and 10am.

`Application Exercise`

10:00

Open appex-03.qmd
Using the confounders identified, fit a propensity score model for park_extra_magic_morning
Stretch: Create two histograms, one of the propensity scores for days with extra morning magic hours and one for those without

Propensity scores

Matching

Weighting

Stratification

Direct Adjustment

…

Estimands

Greifer, N., & Stuart, E. A. (2021). Choosing the estimand when matching or weighting in observational studies. arXiv preprint arXiv:2106.10577. See also Choosing Estimands in our book.

Propensity scores

Matching

Weighting

Stratification

Direct Adjustment

…

Target estimands

Average Treatment Effect (ATE)

\[\tau = E[Y(1) - Y(0)]\]

Target estimands

Average Treatment Effect among the Treated (ATT)

\[\tau = E[Y(1) - Y(0) | X = 1]\]

Matching in R (ATT)

library(MatchIt)
m <- matchit(qsmk ~ sex + 
    race + age + I(age^2) + education + 
    smokeintensity + I(smokeintensity^2) + 
    smokeyrs + I(smokeyrs^2) + exercise + active + 
    wt71 + I(wt71^2), 
  data = nhefs_complete)
m

A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression
 - number of obs.: 1566 (original), 806 (matched)
 - target estimand: ATT
 - covariates: sex, race, age, I(age^2), education, smokeintensity, I(smokeintensity^2), smokeyrs, I(smokeyrs^2), exercise, active, wt71, I(wt71^2)

Matching in R (ATT)

matched_data <- get_matches(m, id = "i")
glimpse(matched_data)

Rows: 806
Columns: 71
$ i                 <chr> "11", "1220", "15", "1082", "18"…
$ subclass          <fct> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,…
$ weights           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ seqn              <dbl> 428, 23045, 446, 22294, 596, 140…
$ qsmk              <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,…
$ death             <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
$ yrdth             <dbl> NA, NA, 88, NA, NA, NA, NA, NA, …
$ modth             <dbl> NA, NA, 1, NA, NA, NA, NA, NA, N…
$ dadth             <dbl> NA, NA, 3, NA, NA, NA, NA, NA, N…
$ sbp               <dbl> 135, 159, 141, 113, 151, NA, 125…
$ dbp               <dbl> 89, 91, 79, 73, 80, NA, 71, 85, …
$ sex               <fct> 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,…
$ age               <dbl> 43, 49, 71, 36, 48, 51, 56, 40, …
$ race              <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ income            <dbl> 19, 22, 17, 21, 18, 22, 20, 18, …
$ marital           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ school            <dbl> 12, 12, 0, 12, 12, 9, 12, 10, 17…
$ education         <fct> 3, 3, 1, 3, 3, 2, 3, 2, 5, 5, 3,…
$ ht                <dbl> 176.5938, 160.2812, 147.0938, 17…
$ wt71              <dbl> 63.96, 47.29, 75.64, 68.38, 62.0…
$ wt82              <dbl> 79.83226, 53.07031, 56.69905, 73…
$ wt82_71           <dbl> 15.8722571, 5.7803073, -18.94095…
$ birthplace        <dbl> 42, 30, NA, 19, 36, 47, NA, 42, …
$ smokeintensity    <dbl> 30, 20, 40, 9, 2, 5, 20, 5, 30, …
$ smkintensity82_71 <dbl> -30, 0, -40, 1, -2, 1, -20, 0, -…
$ smokeyrs          <dbl> 24, 29, 41, 30, 30, 29, 11, 20, …
$ asthma            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bronch            <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,…
$ tb                <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
$ hf                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hbp               <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 1,…
$ pepticulcer       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ colitis           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hepatitis         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ chroniccough      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
$ hayfever          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ diabetes          <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0,…
$ polio             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ tumor             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nervousbreak      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ alcoholpy         <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
$ alcoholfreq       <dbl> 3, 0, 4, 0, 1, 2, 3, 1, 3, 0, 2,…
$ alcoholtype       <dbl> 3, 3, 4, 1, 2, 3, 4, 1, 2, 3, 3,…
$ alcoholhowmuch    <dbl> 2, 2, NA, 6, 1, 3, NA, 5, 1, 3, …
$ pica              <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0,…
$ headache          <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
$ otherpain         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weakheart         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ allergies         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nerves            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ lackpep           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hbpmed            <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 1,…
$ boweltrouble      <dbl> 0, 2, 1, 2, 0, 0, 0, 0, 0, 2, 0,…
$ wtloss            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ infection         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ active            <fct> 1, 2, 1, 0, 1, 0, 0, 0, 0, 1, 2,…
$ exercise          <fct> 1, 1, 1, 0, 1, 0, 2, 0, 0, 2, 2,…
$ birthcontrol      <dbl> 2, 0, 0, 2, 0, 2, 0, 0, 2, 0, 2,…
$ pregnancies       <dbl> NA, 4, 15, NA, 3, NA, 4, 2, NA, …
$ cholesterol       <dbl> 173, 279, 229, 200, 225, 199, 23…
$ hightax82         <dbl> 0, 0, NA, 0, 0, 0, NA, 0, 0, 0, …
$ price71           <dbl> 2.346680, 2.104980, NA, 2.199707…
$ price82           <dbl> 1.797363, 1.698242, NA, 1.847900…
$ tax71             <dbl> 1.3649902, 1.0498047, NA, 1.1022…
$ tax82             <dbl> 0.5718994, 0.4399414, NA, 0.5718…
$ price71_82        <dbl> 0.54931641, 0.40686035, NA, 0.35…
$ tax71_82          <dbl> 0.7929688, 0.6099854, NA, 0.5303…
$ id                <int> 11, 1274, 15, 1135, 18, 564, 23,…
$ censored          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ older             <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1,…
$ distance          <dbl> 0.2384597, 0.2381918, 0.2090935,…

Target estimands

Average Treatment Effect among the Controls (ATC)

\[\tau = E[Y(1) - Y(0) | X = 0]\]

Matching in R (ATC)

library(MatchIt)
m <- matchit(qsmk ~ sex + 
    race + age + I(age^2) + education + 
    smokeintensity + I(smokeintensity^2) + 
    smokeyrs + I(smokeyrs^2) + exercise + active + 
    wt71 + I(wt71^2), 
  data = nhefs_complete,
  estimand = "ATC")
m

A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression
 - number of obs.: 1566 (original), 806 (matched)
 - target estimand: ATC
 - covariates: sex, race, age, I(age^2), education, smokeintensity, I(smokeintensity^2), smokeyrs, I(smokeyrs^2), exercise, active, wt71, I(wt71^2)

Target estimands

Average Treatment Effect among the Matched (ATM)

Matching in R (ATM)

library(MatchIt)
m <- matchit(qsmk ~ sex + 
    race + age + I(age^2) + education + 
    smokeintensity + I(smokeintensity^2) + 
    smokeyrs + I(smokeyrs^2) + exercise + active + 
    wt71 + I(wt71^2), 
  data = nhefs_complete,
  link = "linear.logit", 
  caliper = 0.1) 
m

Observations with propensity scores (on the linear logit scale) within 0.1 standard errors (the caliper) will be discarded

Matching in R (ATM)

A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]
             - estimated with logistic regression and linearized
 - caliper: <distance> (0.063)
 - number of obs.: 1566 (original), 780 (matched)
 - target estimand: ATT
 - covariates: sex, race, age, I(age^2), education, smokeintensity, I(smokeintensity^2), smokeyrs, I(smokeyrs^2), exercise, active, wt71, I(wt71^2)

Matching in R (ATM)

matched_data <- get_matches(m, id = "i")
glimpse(matched_data)

Rows: 780
Columns: 71
$ i                 <chr> "11", "1220", "15", "1082", "18"…
$ subclass          <fct> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6,…
$ weights           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ seqn              <dbl> 428, 23045, 446, 22294, 596, 140…
$ qsmk              <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,…
$ death             <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
$ yrdth             <dbl> NA, NA, 88, NA, NA, NA, NA, NA, …
$ modth             <dbl> NA, NA, 1, NA, NA, NA, NA, NA, N…
$ dadth             <dbl> NA, NA, 3, NA, NA, NA, NA, NA, N…
$ sbp               <dbl> 135, 159, 141, 113, 151, NA, 125…
$ dbp               <dbl> 89, 91, 79, 73, 80, NA, 71, 85, …
$ sex               <fct> 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,…
$ age               <dbl> 43, 49, 71, 36, 48, 51, 56, 40, …
$ race              <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ income            <dbl> 19, 22, 17, 21, 18, 22, 20, 18, …
$ marital           <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,…
$ school            <dbl> 12, 12, 0, 12, 12, 9, 12, 10, 17…
$ education         <fct> 3, 3, 1, 3, 3, 2, 3, 2, 5, 5, 2,…
$ ht                <dbl> 176.5938, 160.2812, 147.0938, 17…
$ wt71              <dbl> 63.96, 47.29, 75.64, 68.38, 62.0…
$ wt82              <dbl> 79.83226, 53.07031, 56.69905, 73…
$ wt82_71           <dbl> 15.8722571, 5.7803073, -18.94095…
$ birthplace        <dbl> 42, 30, NA, 19, 36, 47, NA, 42, …
$ smokeintensity    <dbl> 30, 20, 40, 9, 2, 5, 20, 5, 30, …
$ smkintensity82_71 <dbl> -30, 0, -40, 1, -2, 1, -20, 0, -…
$ smokeyrs          <dbl> 24, 29, 41, 30, 30, 29, 11, 20, …
$ asthma            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ bronch            <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,…
$ tb                <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
$ hf                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hbp               <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0,…
$ pepticulcer       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ colitis           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hepatitis         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ chroniccough      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
$ hayfever          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ diabetes          <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0,…
$ polio             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ tumor             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nervousbreak      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ alcoholpy         <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
$ alcoholfreq       <dbl> 3, 0, 4, 0, 1, 2, 3, 1, 3, 0, 1,…
$ alcoholtype       <dbl> 3, 3, 4, 1, 2, 3, 4, 1, 2, 3, 3,…
$ alcoholhowmuch    <dbl> 2, 2, NA, 6, 1, 3, NA, 5, 1, 3, …
$ pica              <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0,…
$ headache          <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,…
$ otherpain         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weakheart         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ allergies         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nerves            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ lackpep           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hbpmed            <dbl> 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 0,…
$ boweltrouble      <dbl> 0, 2, 1, 2, 0, 0, 0, 0, 0, 2, 0,…
$ wtloss            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ infection         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ active            <fct> 1, 2, 1, 0, 1, 0, 0, 0, 0, 1, 1,…
$ exercise          <fct> 1, 1, 1, 0, 1, 0, 2, 0, 0, 2, 2,…
$ birthcontrol      <dbl> 2, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0,…
$ pregnancies       <dbl> NA, 4, 15, NA, 3, NA, 4, 2, NA, …
$ cholesterol       <dbl> 173, 279, 229, 200, 225, 199, 23…
$ hightax82         <dbl> 0, 0, NA, 0, 0, 0, NA, 0, 0, 0, …
$ price71           <dbl> 2.346680, 2.104980, NA, 2.199707…
$ price82           <dbl> 1.797363, 1.698242, NA, 1.847900…
$ tax71             <dbl> 1.3649902, 1.0498047, NA, 1.1022…
$ tax82             <dbl> 0.5718994, 0.4399414, NA, 0.5718…
$ price71_82        <dbl> 0.54931641, 0.40686035, NA, 0.35…
$ tax71_82          <dbl> 0.7929688, 0.6099854, NA, 0.5303…
$ id                <int> 11, 1274, 15, 1135, 18, 564, 23,…
$ censored          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ older             <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1,…
$ distance          <dbl> -1.1611429, -1.1626186, -1.33039…

Your Turn 1

10:00

Using the propensity scores you created in the previous exercise, create a “matched” data set using the ATM method with a caliper of 0.2.

Propensity scores

Matching

Weighting

Stratification

Direct Adjustment

…

Target estimands: ATE

Average Treatment Effect (ATE)

\[\Large w_{ATE} = \frac{X_i}{p_i} + \frac{1-X_i}{1 - p_i}\]

(X / p) + ((1 - X) / (1 - p))

Target estimands: ATT & ATC

Average Treatment Effect Among the Treated (ATT) \[\Large w_{ATT} = \frac{p_i X_i}{p_i} + \frac{p_i (1-X_i)}{1-p_i}\]

((p * X) / p) + ((p * (1 - X)) / (1 - p))

Target estimands: ATT & ATC

Average Treatment Effect Among the Controls (ATC) \[\Large w_{ATC} = \frac{(1-p_i)X_i}{p_i} + \frac{(1-p_i)(1-X_i)}{(1-p_i)}\]

(((1 - p) * X) / p) + (((1 - p) * (1 - X)) / (1 - p))

Target estimands: ATM & ATO

Average Treatment Effect Among the Evenly Matchable (ATM) \[\Large w_{ATM} = \frac{\min \{p_i, 1-p_i\}}{X_ip_i + (1-X_i)(1-p_i)}\]

pmin(p, 1 - p) / (X * p + (1 - X) * (1 - p))

Target estimands: ATM & ATO

Average Treatment Effect Among the Overlap Population \[\Large w_{ATO} = (1-p_i)X_i + p_i(1-X_i)\]

(1 - p) * X + p * (1 - X)

Histogram of propensity scores

ATE

ATT

ATC

ATM

ATO

ATE in R

Average Treatment Effect (ATE) \(w_{ATE} = \frac{X_i}{p_i} + \frac{1-X_i}{1 - p_i}\)

library(propensity)
df <- propensity_model |>
  augment(type.predict = "response", data = nhefs_complete) |>
  mutate(w_ate = wt_ate(.fitted, qsmk))

`Application Exercise`

10:00

Using the propensity scores you created in the previous exercise, add the ATE weights to your data frame
Stretch: Using the same propensity scores, create ATM weights