What Can Go Wrong

Biases and Diagnostics · Session 4

Claudius Gräbner-Radkowitsch

2026-05-21

Today’s session

When can we trust OLS output? — the Gauss-Markov framework
Functional form and the linearity assumption
Heteroskedasticity and robust standard errors
Multicollinearity
Influential observations
Normality of residuals
What diagnostics cannot detect: omitted variable bias

A regression output we need to interrogate

We have estimated hotel price as a function of distance and star rating.

Are these coefficient estimates pointing in the right direction?
Are the standard errors — and the stars — reliable?
Is there anything the diagnostics cannot reveal?

	(1)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	-30.70***
	(5.15)
distance	-4.65***
	(0.48)
stars	57.13***
	(1.41)
Num.Obs.	9648
R2	0.156
R2 Adj.	0.155

This requires theoretical considerations and formal diagnostics

The OLS Contract

BLUE: a guarantee with conditions

Best Linear Unbiased Estimator — OLS is BLUE when its assumptions hold

Theorem about properties of the OLS sampling distribution
Unbiased: on average, the estimate equals the true value — \(E[\hat{\beta}] = \beta\)
Efficient (Best): among all linear unbiased estimators, OLS has the smallest variance
These are different properties, protected by different assumptions

Unbiasedness = pointing in the right direction on average Efficiency = grouping tightly around the truth

Unbiasedness and efficiency: visualised

Blue: correct answer, precise. Gold: correct direction, imprecise. Red: wrong answer entirely.

The assumption checklist

Assumption	Formula	Plain English	Cost if violated
Linearity	\(y = X\beta + u\)	Correct functional form	Biased estimates
No perfect collinearity	\(\text{rank}(X) = k\)	Predictors not identical	Cannot estimate
Exogeneity	\(E[u \mid X] = 0\)	No omitted confounders	Biased estimates
Homoskedasticity	\(\text{Var}(u \mid X) = \sigma^2\)	Constant error variance	SEs are wrong
(Normality)	\(u \mid X \sim \mathcal{N}(0,\sigma^2)\)	Normal errors	Exact inference only

Assumptions 1 & 3 broken → wrong answer even in large samples Assumption 4 broken → right answer, wrong uncertainty

Why \(E[u \mid X] = 0\) fails: five routes

Omitted variable bias — a relevant determinant of \(y\) is left in \(u\) and correlated with \(X\) (e.g. ability in wage regressions)
Measurement error in regressors — if \(X\) is observed with error correlated with the true regressor, estimates are attenuated or biased
Simultaneity / reverse causality — \(X\) affects \(y\) and \(y\) affects \(X\); the regressor is jointly determined with the outcome (price–quantity, income–health)
Dynamic models with serially correlated errors — including \(y_{t-1}\) as a regressor while \(u_t\) is autocorrelated makes the lag endogenous
Functional form misspecification — wrong transformation or omitted interactions push systematic structure into \(u\), which then correlates with included regressors

All five routes share one consequence: \(\hat{\beta}\) is biased and inconsistent — wrong even in large samples.

Our diagnostic dashboard

One function covers five of the six checks:

performance::check_model(model_base)

Each panel targets one assumption — or one practical robustness concern (influential observations). We will go through them in order.

Our diagnostic dashboard — output

Functional Form

Linearity in parameters ≠ linearity in variables

OLS requires the model to be linear in the parameters — not in the raw variables.

Transformation	Interpretation of \(\hat{\beta}\)
`y ~ x` (level–level)	+1 unit of x → \(+\hat{\beta}\) units of y
`log(y) ~ x` (log–level)	+1 unit of x → \(+(\hat{\beta} \times 100)\%\) change in y
`y ~ log(x)` (level–log)	+1% change in x → \(+\hat{\beta}/100\) units of y
`log(y) ~ log(x)` (log–log)	+1% change in x → \(+\hat{\beta}\%\) change in y
`y ~ x + I(x^2)` (quadratic)	Relationship has a turning point

You already used log transformations in Task 1 — this session extends that logic.

Residuals vs. fitted: the primary diagnostic

The red smoother should be flat. A systematic curve means something is missing.

Hotels-europe: linear vs. log-log

The RESET test

Formal test for functional form misspecification (lmtest::resettest()).

resettest(m_linear)


    RESET test

data:  m_linear
RESET = 7.1487, df1 = 2, df2 = 9644, p-value = 0.00079

Warning

A rejection tells you something is wrong — not what to change. Use the residual plot to guide the fix; don’t let the test prescribe it.

Fix 1 — Log transformation for a multiplicative relationship

lm(log(price) ~ log(distance + 0.1), data = hotels)

Interpretation: \(\hat{\beta}\) is an elasticity — a 1% increase in \(x\) is associated with a \(\hat{\beta}\)% change in \(y\). Example: log(price) ~ log(distance) with \(\hat{\beta} = -0.12\) → “10% further from centre → price 1.2% lower.”

Fix 2 — Quadratic term for a curved relationship

lm(price ~ distance + I(distance^2), data = hotels)

Interpretation: the marginal effect of \(x\) is \(\hat{\beta}_1 + 2\hat{\beta}_2 x\) — it varies with \(x\). Turning point at \(x^* = -\hat{\beta}_1 / (2\hat{\beta}_2)\). Note: with \(\hat{\beta}_2 < 0\), the effect of \(x\) on \(y\) diminishes and eventually reverses.

Heteroskedasticity

Non-constant error variance

OLS assumes \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\) — the same uncertainty everywhere.

What goes wrong when this fails:

OLS estimates are still unbiased — the direction and magnitude are correct on average
But standard errors are inconsistent — too narrow in some places, too wide in others
t-tests and confidence intervals mislead you about precision

Example: hotel prices in Istanbul cluster tightly around the regression line; London hotels scatter much more widely. A single SE cannot represent both groups honestly.

Scale-location plot: the visual test

The red smoother should be flat and horizontal.

Hotels-europe: heteroskedasticity in practice

The Breusch-Pagan test

bptest(model_base)


    studentized Breusch-Pagan test

data:  model_base
BP = 30.437, df = 2, p-value = 2.459e-07

Null: homoskedasticity (constant variance)
Rejection → heteroskedasticity present — standard errors are not reliable

Note

The test confirms what the plot already showed. Always look at the plot first — the test just formalises the diagnosis.

The fix: Heteroskedasticity-robust (HC) standard errors

Replace standard formula with a “sandwich” estimator
Uses each observation’s actual squared residual to estimate uncertainty.
Result: larger SEs where the model struggles, smaller where it fits well.
- Estimates do not change.

modelsummary(list("Standard" = model_base, "Robust (HC1)" = model_base),
             vcov = list("classical", "HC1"), stars = TRUE)

	Standard	Robust (HC1)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	-30.70***	-30.70***
	(5.15)	(5.69)
distance	-4.65***	-4.65***
	(0.48)	(0.31)
stars	57.13***	57.13***
	(1.41)	(1.84)
Num.Obs.	9648	9648
R2	0.156	0.156
R2 Adj.	0.155	0.155
Std.Errors	IID	HC1

Rule of thumb: with cross-sectional data, use robust SEs by default.

Multicollinearity

When predictors move together

Perfectly correlated predictors: OLS breaks down

Highly correlated predictors: isolating individual contributions becomes difficult

Consequence:

Estimates remain unbiased — this is an efficiency problem, not a bias problem
Standard errors inflate — imprecise, unstable estimates
Small changes to the sample can flip signs or change magnitudes substantially

Example: imagine adding both stars (hotel classification) and rating (guest satisfaction score) to the model. Both measure “quality” — how to check whether this is problematic?

VIF: the Variance Inflation Factor

VIF: by how much is variance of \(\hat{\beta}_j\) inflated relative to a model with no correlation among predictors.

\[\text{VIF}_j = \frac{1}{1 - R^2_j}\]

\(R^2_j\): the \(R^2\) from regressing predictor \(j\) on all other predictors.

VIF	Interpretation
1	No inflation — predictor is uncorrelated with others
1–5	Mild — generally acceptable
5–10	Moderate concern — inspect carefully
> 10	Serious — standard errors substantially inflated

Hotels-europe: low vs. problematic VIF

Base model — distance and stars measure different things: VIF is fine.

vif(model_base)   # price ~ distance + stars

distance    stars 
1.003385 1.003385

Redundant model — a researcher includes both stars and log(stars) to “try both functional forms.” They are near-identical (r ≈ 0.97): VIF explodes.

model_collinear <- lm(price ~ distance + stars + log(stars), data = hotels)
vif(model_collinear)

  distance      stars log(stars) 
  1.004235  20.543854  20.526349

Warning

Including a variable and a monotonic transformation of it simultaneously is the most common source of severe collinearity. The model cannot separate their effects — the SEs become unreliable. Drop one or the other.

What to do about multicollinearity

Do not drop variables mechanically based on VIF — think about what each variable represents
Dropping an important control to reduce VIF can introduce omitted variable bias (a far worse problem)
If two variables genuinely measure the same construct, keep only one — or combine them into a single index (e.g. a simple average, or the first principal component)
Sometimes collinearity is just a feature of reality — acknowledge it and note that individual coefficients should be interpreted with caution

Warning

Interaction terms (x * z) always produce high VIF for their component variables. This is expected and harmless — centering the variables reduces VIF cosmetically but does not change inference.

Influential Observations

Three types of unusual data points

Type	What makes it unusual	Does it distort results?
Outlier	Large residual — unusual y given x	Can do, but not always
High-leverage point	Extreme position in x space	Can do, but not always
Influential observation	Removing it substantially changes the estimates	Yes, by definition

The dangerous case: a point that is both an outlier and high-leverage.

Cook’s distance

\[D_i = \frac{(\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})^\top (X^\top X) (\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})}{p \cdot \hat{\sigma}^2}\]

\(D_i\): how much do the estimated coefficients shift when observation \(i\) is removed.

Rule of thumb: \(D_i > 4/n\) flags an observation worth investigating.

Example: one influential point

The dashed line is the original fit. One point pulls the slope down substantially.

Hotels-europe: who are the influential hotels?

augment(model_base) |>
  bind_cols(hotels |> select(hotel_id, city)) |>
  arrange(desc(.cooksd)) |>
  select(hotel_id, city, stars, price, .cooksd) |>
  head(8)

# A tibble: 8 × 5
  hotel_id city      stars price .cooksd
     <dbl> <chr>     <dbl> <dbl>   <dbl>
1    14797 Prague        4  3714  0.0477
2    14797 Prague        4  3714  0.0477
3     1392 Barcelona     1  1213  0.0313
4     1392 Barcelona     1  1213  0.0313
5     8321 London        5  1848  0.0295
6    13394 Paris         5  1364  0.0139
7    14307 Paris         5  1265  0.0112
8    12644 Paris         5  1218  0.0101

Investigate — don’t delete

An influential observation is not automatically an error. It may be the most interesting case in the dataset.
Ask: why is this point influential? Data entry error, genuine extreme case, or a different sub-population?
Workflow: re-estimate without the flagged observations. Report both results and note the sensitivity.
Silently removing influential points without disclosure is a serious methodological error.

Normality of Residuals

The Q-Q plot: good vs. violated

A widespread misconception

What normality actually protects: Exact finite-sample inference — t- and F-statistics follow their named distributions exactly only when errors are normal.

Why it rarely matters in practice:

With moderate-to-large \(n\): the Central Limit Theorem ensures \(\hat{\beta}\) is approximately normally distributed regardless of the residual distribution
For our hotels dataset (\(n \approx 9{,}600\)): normality of residuals is essentially irrelevant for inference
When it does matter: very small samples (\(n < 30\)), exact prediction intervals

The Shapiro-Wilk trap

shapiro.test(resid(model_base)[1:5000])  # max n for shapiro.test is 5000


    Shapiro-Wilk normality test

data:  resid(model_base)[1:5000]
W = 0.80384, p-value < 2.2e-16

Warning

With large \(n\), Shapiro-Wilk almost always rejects — it detects trivially small deviations from normality that have no practical consequence for inference.

Do not interpret a significant p-value here as a problem. Check the Q-Q plot visually instead.

The Posterior Predictive Check

What the PPC shows

The PPC simulates new response values from the fitted model many times and compares their distribution to the observed data.

The PPC catches what the other five panels cannot: a fundamental mismatch between the model’s implied data-generating process and the actual distribution of \(y\).

PPC: level-level vs. log-log

Blue line: density of the observed outcome \(y\)
Green lines: densities of outcomes simulated from the fitted model
If green ≈ blue → the model generates plausible data → distributional assumptions are reasonable
If they diverge → the model produces values in the wrong range → mis-specified DGP

Dashboard panels 1–2: functional form and variance

Linearity — residuals vs. fitted; the red smoother should be flat. Systematic curve → functional form problem
Homogeneity of variance — scale-location; the smoother should be horizontal. Remaining fan shape → use robust SEs even with the log model

Dashboard panels 3–4: collinearity and normality

Collinearity (VIF) — bars should be short (< 5). Both predictors are fine; recall the stars + log(stars) example where VIF ≈ 20
Normality of residuals (Q-Q) — much closer to the diagonal than model_base (skewness −0.1 vs. 7.8); with \(n > 9{,}000\) the remaining departure does not affect inference

Dashboard panels 5–6: influential observations and overall fit

Outliers / influential observations — Cook’s D bar and leverage–residuals scatter; points outside the contour lines warrant investigation (report with and without, do not silently delete)
Posterior predictive check — green lines now closely track the blue observed density. The log transformation resolved the distributional mismatch we saw with model_base.

What Diagnostics Cannot Detect

What residuals cannot tell you

All diagnostic panels show properties of the residuals — the part of \(y\) the model did not explain.

But every route to endogeneity (\(E[u \mid X] \neq 0\)) is invisible in the residuals:

Omitted variable bias — a confounder is absorbed into \(\hat{\beta}\); the fit looks fine
Measurement error in regressors — attenuation bias hides behind well-behaved residuals
Simultaneity / reverse causality — the endogenous regressor leaves no obvious residual pattern

All three share the same logic: the bias is inside \(\hat{\beta}\), not visible in \(\hat{u}\).

We focus on OVB — the most common case in business and economics research.

The mechanism

Suppose the true model is:

\[y = \beta_0 + \beta_1 x + \beta_2 z + \varepsilon\]

but we estimate \(y = \beta_0 + \beta_1 x + \varepsilon\) (omitting \(z\)).

The estimated \(\hat{\beta}_1\) absorbs part of the effect of \(z\) — and is biased unless \(z\) is uncorrelated with \(x\).

Direction-of-bias rule (no algebra needed):

\[\text{sign(bias)} = \text{sign}(\text{effect of } z \text{ on } y) \times \text{sign}(\text{corr}(z, x))\]

OVB in practice: what changes when we add city?

m_no_city   <- lm(log(price) ~ log(distance + 0.1), data = hotels)
m_with_city <- lm(log(price) ~ log(distance + 0.1) + city, data = hotels)

	Without city	With city controls
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
log(distance)	-0.064***	-0.214***
	(0.008)	(0.007)
Num.Obs.	9648	9648
R2	0.007	0.370
R2 Adj.	0.007	0.370
City FE	No	Yes

The distance coefficient changes substantially — because city drives both price level and how far hotels are from the centre.

Why the residuals looked fine

Run the full diagnostic suite on the biased model (no city controls):

Both panels pass \(\rightarrow\) residuals scatter randomly, variance looks stable. Yet we know from the previous slide that \(\hat{\beta}_\text{distance}\) is wrong.

Important

The bias was absorbed into \(\hat{\beta}\) when the model was fitted — the residuals have already been “cleaned” of it. OVB is invisible to any diagnostic tool. The solution is research design, not better diagnostics.

The way forward

Panel data and fixed effects (panel data session):

If the omitted variable is time-invariant (e.g. which city), fixed effects eliminate it
One of the most powerful practical solutions to OVB

Causal identification (optional session on causality):

Randomisation, quasi-experiments
Address OVB at the design stage, not the estimation stage

The transition from describing patterns to making causal claims is one of the most important methodological steps in applied data science.

Summary

The diagnostic workflow

For every regression, ask:

Residuals vs. fitted — is the functional form correct? Is there a pattern?
Scale-location — is the variance constant? Do I need robust SEs?
Cook’s distance / leverage — are a few observations driving my results?
VIF — are my predictors too correlated to separate their effects?
Q-Q plot — are there serious departures from normality? (Large \(n\): rarely a concern)
Thinking — what important variable might I have left out?

performance::check_model(your_model)  # panels 1–5 in one call

The sixth question — OVB — requires subject-matter knowledge, not a function.