Session 6 — Quantitative Data Analysis: Applied Data Science and Modern Econometrics
Europa-Universität Flensburg & Johannes Kepler University Linz
2026-05-28
Cross-sectional evidence: yes, clearly.
Figure 1
Richard Easterlin (1974): richer countries are happier, yet rising incomes within countries don’t always produce rising happiness.
How can both patterns be true at the same time?
This is not just a happiness puzzle
It is a fundamental question about what kind of comparison we are making — and it shows the value added of panel data.
Between-country
Do richer countries have happier populations than poorer ones?
→ Compares different countries at one point in time
Within-country
As a country's income rises, does its happiness rise too?
→ Tracks the same country across time
These are genuinely different empirical questions — and they can give different answers.
Japan: survey questions changed over time, making comparisons invalid. Once you restrict to consistent question periods, happiness and income track together.
USA: aggregate income grew, but median incomes stagnated due to rising inequality. Most people did not experience the income growth the average implied.
Key insight
The “paradox” disappears once you are careful about which units you compare and over what time period. Panel data methods force this discipline.
Even if the within-country correlation is real, a cross-sectional regression misleads us.
Richer countries also have:
These all correlate with both income and happiness — but they are not income.
Omitted variable bias
A regression of happiness on income across countries picks up all of these country-level differences and misattributes them to income. This is OVB — a familiar enemy.
The promise: if the confounders (\(\eta_i\)) are stable over time, panel methods can control for them — even if we never observe them directly.
The payoff: estimates that use only within-unit change over time, holding stable characteristics constant by construction.
This is what today’s session is about.
Cross-section
N units, one time point
obs indexed by $i$
e.g. happiness survey across 140 countries in 2019
Time series
1 unit, T time points
obs indexed by $t$
e.g. US life satisfaction, 2005–2023
Panel data ✓
N units, T time points
obs indexed by $(i, t)$
e.g. ~140 countries surveyed annually 2005–2023
Today’s dataset: World Happiness Report panel — life satisfaction and GDP per capita for ~140 countries, 2005–2023.
Balanced panel: every unit observed at every time point — the data rectangle is complete.
Unbalanced panel: some units missing at some time points — the most common situation in practice.
Balanced
| country | 2010 | 2015 | 2020 |
|---|---|---|---|
| Germany | ✓ | ✓ | ✓ |
| Japan | ✓ | ✓ | ✓ |
| Brazil | ✓ | ✓ | ✓ |
Unbalanced
| country | 2010 | 2015 | 2020 |
|---|---|---|---|
| Germany | ✓ | ✓ | ✓ |
| Japan | ✓ | ✗ | ✓ |
| Brazil | ✗ | ✓ | ✓ |
The WHR panel is unbalanced — not every country is surveyed in every year. This is typical and not problematic for FE estimation.
The simplest approach: treat all \((i,t)\) pairs as independent observations.
\[y_{it} = \alpha + \beta \, x_{it} + \epsilon_{it}\]
Applied to Easterlin: regress happiness on log GDP, pooling all countries and years.
The problem: observations from the same country across years are not independent. The error \(\epsilon_{it}\) is not truly random — it has two components:
\[\epsilon_{it} = \underbrace{\eta_i}_{\text{stable, country-specific}} + \underbrace{v_{it}}_{\text{idiosyncratic}}\]
The failure: if \(\eta_i\) correlates with \(x_{it}\), pooled OLS attributes variation that comes from country differences to income → OVB.
Scandinavian countries have high happiness and high GDP and strong institutions. Those institutions are not caused by income — they are \(\eta_i\).
The fix: \(\eta_i\) is stable over time, so subtract each country’s time-average:
\[\tilde{y}_{it} = y_{it} - \bar{y}_i \qquad \tilde{x}_{it} = x_{it} - \bar{x}_i\]
Because \(\eta_i - \bar{\eta}_i = 0\), the unobserved country effect vanishes entirely.
Preview
This within-transformation is the fixed effects estimator. The rest of the session develops the mechanics and how to apply it in R.
Goal: eliminate \(\eta_i\) without observing it.
Strategy: if \(\eta_i\) is time-invariant, subtract each unit’s time-average from its observations.
For unit \(i\), compute \(\bar{y}_i = T^{-1}\sum_t y_{it}\) and \(\bar{x}_i = T^{-1}\sum_t x_{it}\), then:
\[\tilde{y}_{it} = y_{it} - \bar{y}_i \qquad \tilde{x}_{it} = x_{it} - \bar{x}_i\]
Because \(\eta_i - \bar{\eta}_i = 0\), the individual effect vanishes entirely.
OLS on the demeaned data gives the Fixed Effects (FE) estimator, also called the within estimator.
The FE estimator uses only within-unit variation — how \(x\) and \(y\) move relative to each unit’s own average over time.
The question it answers:
When a country’s income rises above its own historical average, does its happiness tend to rise too?
This is much more credible for causal interpretation than pooled OLS, because all stable country-level confounders — observed or not — are controlled for by construction.
Back to our initial question
Does rising income within a country predict rising happiness? FE controls for stable country characteristics: culture, institutions, geography, political history — none need to be measured.
The within-transformation removes all time-invariant variation — including that of covariates.
Variables that do not change over time for a given unit are differenced out entirely:
Their coefficients cannot be estimated with FE.
Know the limitation
FE is powerful precisely because it discards between-unit information. But sometimes that information is exactly what you need. Knowing when FE is not appropriate is as important as knowing how to use it.
Between estimator
Collapses panel to country means, then runs OLS on averages
Question: do countries with higher average income have higher average happiness?
Dominated by stable cross-country differences — including ηi
Within estimator (FE)
Uses only deviations from country means over time
Question: when income rises above its own average, does happiness follow?
ηi controlled for by construction → consistent
Pooled OLS mixes both, typically dominated by the between component. Be deliberate about which variation you exploit.
Unit FE controls for everything stable across time but varying across units — country culture, institutions, social trust.
But what about shocks that hit all countries in a given year?
These affect all countries in the same period, regardless of their individual characteristics.
If such shocks correlate with our explanatory variable, unit FE alone does not protect us.
Solution: add a separate intercept for each time period.
This controls for any factor that shifts the outcome equally across all units in a given year — observed or not.
The two-way fixed effects model:
\[y_{it} = \underbrace{\alpha_i}_{\text{country FE}} + \underbrace{\gamma_t}_{\text{year FE}} + \beta \, x_{it} + v_{it}\]
\(\hat\beta\) is now identified from variation that is neither explained by which country it is nor which year it is.
Easterlin application
Year FE absorbs shocks that shifted happiness everywhere simultaneously — the global financial crisis in 2008, the pandemic in 2020. Without year FE, those shocks inflate or deflate the estimated income–happiness link.
fixest packagefixest is the modern workhorse for fixed effects estimation in R — fast, flexible, and designed for high-dimensional FE.feols() (fixed effects OLS).| operator separates the regression formula from the fixed effects specification:Why not lm() with dummies?
feols() uses the Frisch-Waugh theorem internally — same estimates, but far faster when \(N\) is large (no dummy matrix needed).
The problem: observations within the same country across years are likely correlated — \(v_{it}\) and \(v_{i,t+1}\) are not independent.
Standard OLS standard errors assume independence and are therefore too small, leading to overconfident inference.
The solution: cluster standard errors at the unit level.
Standard practice
In panel applications, always cluster at the unit level. Report which clustering you used alongside your results.
FE models report an \(R^2\) computed on the demeaned data: how much of the within-country variation in happiness is explained by income.
This is the meaningful fit statistic for FE models.
Classical \(R^2\) can be misleadingly high: unit means are removed before estimation, so part of the total variation is never in play.
fixest reports:
Within R² — fit on demeaned data ← report thisR² — overall (includes between variation) ← less informativeAlways show at least three models side by side:
m1 <- feols(happiness ~ log(GDP_pc), data = dat) # Pooled OLS
m2 <- feols(happiness ~ log(GDP_pc) | country, vcov = ~country, data = dat) # Country FE
m3 <- feols(happiness ~ log(GDP_pc) | country + year, vcov = ~country, data = dat) # Two-way FE
modelsummary(
list("Pooled OLS" = m1, "Country FE" = m2, "Two-way FE" = m3),
gof_map = c("nobs", "r.squared", "r2.within"), stars = TRUE
)| Pooled OLS | Country FE | Two-way FE | |
|---|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
| (Intercept) | -2.307*** | ||
| (0.132) | |||
| log(GDP_pc) | 0.813*** | 1.259*** | 1.554*** |
| (0.014) | (0.217) | (0.309) | |
| Num.Obs. | 1729 | 1727 | 1727 |
| R2 | 0.672 | 0.928 | 0.930 |
| R2 Within | 0.150 | 0.142 | |
Recall from Session 5: a regression coefficient is causal only if you have an identification argument — a claim that the variation in \(X\) being exploited is “as good as random” with respect to \(Y\).
Pooled OLS has no such argument. It mixes within- and between-country variation; stable country differences (institutions, culture, geography) contaminate \(\hat\beta\).
Fixed effects provides one:
“By comparing each country to itself over time, I remove all stable between-country differences — observed or not. The remaining income variation is not contaminated by time-invariant confounders.”
In DAG terms
Country FE is the panel equivalent of blocking all backdoor paths that run through \(\eta_i\) — not by measuring culture, institutions, or geography, but by asking each country to serve as its own counterfactual.
FE handles time-invariant confounders. Three threats remain:
1. Time-varying confounders
Variables that change over time and independently affect both income and happiness — e.g. a political reform that simultaneously boosts growth and improves governance. Partial remedies: year FE (common shocks), explicit controls (Gini, unemployment).
2. No feedback loops
FE requires strict exogeneity: past happiness must not affect future income. If dissatisfied workers reduce effort → lower income → lower happiness, the within-country coefficient is biased even after demeaning.
3. FE is one step on a longer ladder
| Design | Controls for |
|---|---|
| Pooled OLS | Nothing |
| Fixed effects | Time-invariant confounders |
| Difference-in-differences | + group-specific time trends (parallel trends) |
| Randomised experiment | Everything — by design |
Session 7
The working-from-home experiment is an RCT. Its coefficient often claim a causal label without any FE. But this comes with its own costs…
feols(happiness ~ log(GDP_pc) | country + year) gives the within-country, year-adjusted association between income and happiness.
Already controlled for:
Not yet controlled for: time-varying factors that change differently across countries and independently affect both income and happiness.
What to control for?
Which variables you should control for depends on your research question: are you interested in the total within country association between income and happiness? Or in the share of the association that is due to the effect of income on improved healthcare? Or something else entirely?
Most WHR variables — social support, healthy life expectancy, freedom to choose — are likely mediators, not confounders.
Better time-varying controls for this analysis:
Most WHR variables — social support, healthy life expectancy, freedom to choose — are likely mediators, not confounders.
Better time-varying controls for this analysis:
Live demo
We work through this in the demo: baseline FE → adding Gini and unemployment → unpacking where income works directly, indirectly, and not at all.
TBA
FE is the foundation — it handles most applied work in business and economics. Several important extensions exist:
| Method | When to use |
|---|---|
| Random Effects (RE) | \(\eta_i\) plausibly uncorrelated with \(x\); time-invariant covariates matter |
| First Differences (FD) | Research question is about change; errors may follow a random walk |
| Dynamic Panel / GMM | Lagged outcome \(y_{i,t-1}\) is a regressor; standard FE is biased |
| Long panel methods | Large \(T\): non-stationarity, cointegration |
Panel data structure: \(N\) units × \(T\) periods, indexed \((i,t)\); balanced vs. unbalanced; composite error \(\epsilon_{it} = \eta_i + v_{it}\).
Why pooled OLS fails: \(\eta_i\) correlated with \(x_{it}\) → OVB.
The FE estimator: within-transformation removes \(\eta_i\); identifies within-unit variation only; cannot estimate time-invariant covariates.
Between vs. within: two different questions, two different answers — be deliberate about which variation you use.
Two-way FE: adds year fixed effects \(\gamma_t\) to control for common period shocks (recessions, pandemics).
In R: fixest::feols(), cluster at country level, report within-\(R^2\), compare models with modelsummary.
When presenting panel results, always:
gof_map in modelsummary to select itGood practice
A footnote stating “Country and year fixed effects included in columns (2) and (3)” is always cleaner than listing them as coefficient rows.