Causal Diagrams

Lucy D’Agostino McGowan

What causes spurious correlations?

  1. Random chance
  2. Confounders

How do we correct for these?

Random chance

  • Classic statistics!
  • Measures of uncertainty (i.e.confidence intervals)

d <- tibble(
  year = 1999:2009,
  nic_cage = c(2, 2, 2, 3, 1, 1, 2, 3, 4, 1, 4),
  drownings = c(109, 102, 102, 98, 85, 95, 96, 98, 123, 94, 102),
)
cor.test(~ nic_cage + drownings, data = d)

    Pearson's product-moment correlation

data:  nic_cage and drownings
t = 2.7, df = 9, p-value = 0.03
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1101 0.9045
sample estimates:
  cor 
0.666 

How do we correct for these?

Random chance

  • Classic statistics!
  • Measures of uncertainty (i.e.confidence intervals)

Confounders

  • adjust for confounders

d <- tibble(
  year = 2000:2009,
  bedsheets = c(327, 456, 509, 497, 596, 573, 661, 741, 809, 717),
  cheese = c(29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8),
)
cor.test(~bedsheets + cheese, data = d)

    Pearson's product-moment correlation

data:  bedsheets and cheese
t = 8.3, df = 8, p-value = 3e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7864 0.9877
sample estimates:
   cor 
0.9471 

Confounder: time

d <- tibble(
  year = 2000:2009,
  bedsheets = c(327, 456, 509, 497, 596, 573, 661, 741, 809, 717),
  cheese = c(29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8),
  bedsheets_ind = bedsheets - lag(bedsheets),
  cheese_ind = cheese - lag(cheese)
)
cor.test(~bedsheets_ind + cheese_ind, data = d)

    Pearson's product-moment correlation

data:  bedsheets_ind and cheese_ind
t = 0.94, df = 7, p-value = 0.4
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.424  0.817
sample estimates:
   cor 
0.3342 

Causal diagrams

  • Visual depiction of causal relationships
  • Shows variables (nodes) and relationships (edges)
  • Time goes left to right
  • An arrow from one variable to another indicates a direct causal effect

DAGs

  • Directed
  • Acyclic
  • Graph

Does listening to a comedy podcast the morning before an exam improve graduate students test scores?





Application Exercise

  1. Write down factors that you think would influence the exposure and outcome
  2. Turn to your neighbor and discuss their proposal
05:00

Step 1: Specify your DAG

library(ggdag)
dagify(
  podcast ~ mood + humor + prepared,
  exam ~ mood + prepared
)
dag {
exam
humor
mood
podcast
prepared
humor -> podcast
mood -> exam
mood -> podcast
prepared -> exam
prepared -> podcast
}

Step 1: Specify your DAG

podcast_dag <- dagify(
  podcast ~ mood + humor + prepared,
  exam ~ mood + prepared,
  coords = time_ordered_coords(),
  exposure = "podcast",
  outcome = "exam",
  labels = c(
    podcast = "podcast",
    exam = "exam score",
    mood = "mood",
    humor = "humor",
    prepared = "prepared"
  )
)
ggdag(podcast_dag, use_labels = "label", text = FALSE) + 
  theme_dag()

Step 1: Specify your DAG

Causal effects and backdoor paths

Ok, correlation != causation. But why not?

We want to know if x -> y

But other paths also cause associations

ggdag_paths()

Identify “backdoor” paths

ggdag_paths(podcast_dag)

ggdag_paths()

Closing backdoor paths

We need to account for these open, non-causal paths

Randomization

Stratification, adjustment, weighting, matching, etc.

Identifying adjustment sets

ggdag_adjustment_set(podcast_dag)

Identifying adjustment sets

Identifying adjustment sets

library(dagitty)
adjustmentSets(podcast_dag)
{ mood, prepared }

Let’s prove it!

set.seed(10)
sim_data <- podcast_dag |>
  simulate_data()

sim_data
# A tibble: 500 × 5
     exam  humor   mood podcast prepared
    <dbl>  <dbl>  <dbl>   <dbl>    <dbl>
 1 -0.435  0.263 -0.100  -0.630   1.07  
 2 -0.593  0.317  0.143  -1.55    0.0640
 3  0.786  1.97  -0.591  -0.318  -0.439 
 4 -0.103  2.86  -0.139   1.07    0.754 
 5 -0.614 -2.39   0.702   0.464   0.356 
 6  1.01   1.21   0.910   0.769   0.561 
 7  0.167 -1.37  -0.559  -0.866   0.214 
 8  1.16   0.164 -0.743   0.969  -1.67  
 9  0.650  0.215 -0.248   0.691  -0.303 
10  0.156  0.713  1.19   -1.02   -0.219 
# ℹ 490 more rows

Let’s prove it!

Choosing what variables to include

Adjustment sets and domain knowledge

Conduct sensitivity analysis if you don’t have something important

Common trip ups

Using prediction metrics

The 10% rule

Predictors of the outcome, predictors of the exposure

Forgetting to consider time-ordering (something has to happen before something else to cause it!)

Selection bias and colliders (more later!)

Incorrect functional form for confounders (e.g. BMI often non-linear)