Causal Diagrams

Lucy D’Agostino McGowan

What causes spurious correlations?

Random chance
Confounders

How do we correct for these?

Random chance

Classic statistics!
Measures of uncertainty (i.e.confidence intervals)

d <- tibble(
  year = 1999:2009,
  nic_cage = c(2, 2, 2, 3, 1, 1, 2, 3, 4, 1, 4),
  drownings = c(109, 102, 102, 98, 85, 95, 96, 98, 123, 94, 102),
)
cor.test(~ nic_cage + drownings, data = d)


    Pearson's product-moment correlation

data:  nic_cage and drownings
t = 2.7, df = 9, p-value = 0.03
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1101 0.9045
sample estimates:
  cor 
0.666

How do we correct for these?

Random chance

Classic statistics!
Measures of uncertainty (i.e.confidence intervals)

Confounders

adjust for confounders

d <- tibble(
  year = 2000:2009,
  bedsheets = c(327, 456, 509, 497, 596, 573, 661, 741, 809, 717),
  cheese = c(29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8),
)
cor.test(~bedsheets + cheese, data = d)


    Pearson's product-moment correlation

data:  bedsheets and cheese
t = 8.3, df = 8, p-value = 3e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7864 0.9877
sample estimates:
   cor 
0.9471

Confounder: time

d <- tibble(
  year = 2000:2009,
  bedsheets = c(327, 456, 509, 497, 596, 573, 661, 741, 809, 717),
  cheese = c(29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8),
  bedsheets_ind = bedsheets - lag(bedsheets),
  cheese_ind = cheese - lag(cheese)
)
cor.test(~bedsheets_ind + cheese_ind, data = d)


    Pearson's product-moment correlation

data:  bedsheets_ind and cheese_ind
t = 0.94, df = 7, p-value = 0.4
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.424  0.817
sample estimates:
   cor 
0.3342

Causal diagrams

Visual depiction of causal relationships
Shows variables (nodes) and relationships (edges)
Time goes left to right
An arrow from one variable to another indicates a direct causal effect

DAGs

Directed
Acyclic
Graph

Does listening to a comedy podcast the morning before an exam improve graduate students test scores?

Application Exercise

Write down factors that you think would influence the exposure and outcome
Turn to your neighbor and discuss their proposal

05:00

Step 1: Specify your DAG

library(ggdag)
dagify(
  podcast ~ mood + humor + prepared,
  exam ~ mood + prepared
)

dag {
exam
humor
mood
podcast
prepared
humor -> podcast
mood -> exam
mood -> podcast
prepared -> exam
prepared -> podcast
}

Step 1: Specify your DAG

podcast_dag <- dagify(
  podcast ~ mood + humor + prepared,
  exam ~ mood + prepared,
  coords = time_ordered_coords(),
  exposure = "podcast",
  outcome = "exam",
  labels = c(
    podcast = "podcast",
    exam = "exam score",
    mood = "mood",
    humor = "humor",
    prepared = "prepared"
  )
)
ggdag(podcast_dag, use_labels = "label", text = FALSE) + 
  theme_dag()

Step 1: Specify your DAG

Causal effects and backdoor paths

Ok, correlation != causation. But why not?

We want to know if x -> y…

But other paths also cause associations

`ggdag_paths()`

Identify “backdoor” paths

ggdag_paths(podcast_dag)

`ggdag_paths()`

Closing backdoor paths

We need to account for these open, non-causal paths

Randomization

Stratification, adjustment, weighting, matching, etc.

Identifying adjustment sets

ggdag_adjustment_set(podcast_dag)

Identifying adjustment sets

Identifying adjustment sets

library(dagitty)
adjustmentSets(podcast_dag)

{ mood, prepared }

Let’s prove it!

set.seed(10)
sim_data <- podcast_dag |>
  simulate_data()

sim_data

# A tibble: 500 × 5
     exam  humor   mood podcast prepared
    <dbl>  <dbl>  <dbl>   <dbl>    <dbl>
 1 -0.435  0.263 -0.100  -0.630   1.07  
 2 -0.593  0.317  0.143  -1.55    0.0640
 3  0.786  1.97  -0.591  -0.318  -0.439 
 4 -0.103  2.86  -0.139   1.07    0.754 
 5 -0.614 -2.39   0.702   0.464   0.356 
 6  1.01   1.21   0.910   0.769   0.561 
 7  0.167 -1.37  -0.559  -0.866   0.214 
 8  1.16   0.164 -0.743   0.969  -1.67  
 9  0.650  0.215 -0.248   0.691  -0.303 
10  0.156  0.713  1.19   -1.02   -0.219 
# ℹ 490 more rows

Let’s prove it!

Choosing what variables to include

Adjustment sets and domain knowledge

Conduct sensitivity analysis if you don’t have something important

Common trip ups

Using prediction metrics

The 10% rule

Predictors of the outcome, predictors of the exposure

Forgetting to consider time-ordering (something has to happen before something else to cause it!)

Selection bias and colliders (more later!)

Incorrect functional form for confounders (e.g. BMI often non-linear)