Causation vs. Correlation

Thinking Like an Economist

Prof. Dr. Claudius Gräbner-Radkowitsch

2026-05-21

Today’s session

Where we are: You can specify, estimate, and diagnose regression models
Today’s question: What do those coefficients actually claim?
The shift: From estimation to interpretation — from how to what it means

Note

No new estimator today. The methodological move is entirely in interpretation.

A puzzle from hiring data

A consultancy analyses firm-level data and finds:

“Firms that hired more sales staff last year had lower revenue this year.”

Conclusion: Hiring sales staff reduces revenue. Cut the sales team.

Wait. What else might explain this pattern?

The confounder hiding in plain sight

Product quality drives both hiring decisions and revenue.

Firms with poor products hire more staff to compensate
Poor products also depress revenue

The correlation between hiring and revenue is real — but the causal story is backwards.

The identification question

Regression answers:

“How does Y vary with X, holding controls constant?”

That is not the same as:

“What is the effect of X on Y?”

The gap between the two is closed — or not — by an identification argument.

Important

Association vs. causation: A regression coefficient is associational by default. Causal interpretation requires an identification argument that lives outside the regression output.

DAGs: a language for causal structure

DAG (Directed Acyclic Graph): A diagram where nodes are variables and arrows represent direct causal effects. Acyclic means no variable causes itself.

We use DAGs to make our causal assumptions explicit and testable.

Three structural roles a variable can play:

Confounder
Mediator
Collider

Three structural roles

Confounder: Common cause of X and Y → must control
Mediator: On the causal path from X to Y → must not control (if you want the total effect)
Collider: Caused by both X and Y → must not control (conditioning opens a spurious path)

Application: the gender wage gap

We want to know: does gender affect wages?

Let’s draw the causal structure before running a single regression.

What factors are in play?

Direct effect: gender → wage (e.g. discrimination in pay setting)
Occupational sorting: gender → occupation → wage
Unobserved background: family background, networks → both occupation and wage

The gender wage gap DAG

Key question: What happens when we add occupation as a control variable?

The bad control

Controlling for occupation blocks the path: gender → occupation → wage

The coefficient on gender shrinks — but not because we found a better estimate
We are now estimating a different, narrower question: the direct pay gap within occupations
Whether that is interesting depends on the research question

Important

Bad control: A mediator or collider incorrectly added as a control variable. The result is not more precise — it answers a different question.

Live demo: three models, three questions

Using CPS earnings data (149,316 US workers, 2014):

Model	Specification	Question answered
1	`wage ~ gender`	Raw gap
2	`wage ~ gender + age + education`	Gap controlling for demographics
3	`wage ~ gender + age + education + occupation`	Gap within occupations

The coefficient on gender shrinks from Model 1 → 3.

This does not mean Model 3 is more correct. It means each model answers a different question.

What makes a coefficient causal?

Not the regression itself — the identification argument.

An identification argument claims: “The variation in X I am exploiting is as good as random with respect to Y.”

Two strategies you will learn:

Session 6 — Panel fixed effects: Control for all stable unobserved differences across units
Session 7 — Randomised experiments: Randomisation makes treatment assignment independent of everything else

Note

These strategies are not more credible because the math is different. They are more credible because they provide an identification argument.

In-class exercise

Task 1 — Reproduce and interpret: Run the three-model progression on cps-earnings. For each model, write 3–5 sentences: “What causal claim, if any, can this model support — and why?”

Task 2 — Draw the DAG: A firm wants to know whether advertising causes sales. Both are also driven by product quality, and advertising success may itself affect product investment.

Draw the DAG
Identify any confounders, mediators, or colliders
What would you need to control for — and what should you avoid controlling for?

Where we go next

Today: You know what an identification argument is, and why regression alone cannot provide one.

Session 6: Panel fixed effects — exploiting within-unit variation over time to control for all stable unobserved differences.

Session 7: A randomised experiment — where randomisation itself is the identification argument.

Now you know why those sessions matter.