What Can Go Wrong

Biases and Diagnostics · Session 4

Claudius Gräbner-Radkowitsch

2026-05-21

Today’s session

  1. When can we trust OLS output? — the Gauss-Markov framework
  2. Functional form and the linearity assumption
  3. Heteroskedasticity and robust standard errors
  4. Multicollinearity
  5. Influential observations
  6. Normality of residuals
  7. What diagnostics cannot detect: omitted variable bias

A regression output we need to interrogate

We have estimated hotel price as a function of distance and star rating.

  • Are these coefficient estimates pointing in the right direction?
  • Are the standard errors — and the stars — reliable?
  • Is there anything the diagnostics cannot reveal?
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -30.70***
(5.15)
distance -4.65***
(0.48)
stars 57.13***
(1.41)
Num.Obs. 9648
R2 0.156
R2 Adj. 0.155
  • This requires theoretical considerations and formal diagnostics

The OLS Contract

BLUE: a guarantee with conditions

Best Linear Unbiased Estimator — OLS is BLUE when its assumptions hold

  • Theorem about properties of the OLS sampling distribution
  • Unbiased: on average, the estimate equals the true value — \(E[\hat{\beta}] = \beta\)
  • Efficient (Best): among all linear unbiased estimators, OLS has the smallest variance
  • These are different properties, protected by different assumptions

Unbiasedness = pointing in the right direction on average Efficiency = grouping tightly around the truth

Unbiasedness and efficiency: visualised

Blue: correct answer, precise. Gold: correct direction, imprecise. Red: wrong answer entirely.

The assumption checklist

Assumption Formula Plain English Cost if violated
Linearity \(y = X\beta + u\) Correct functional form Biased estimates
No perfect collinearity \(\text{rank}(X) = k\) Predictors not identical Cannot estimate
Exogeneity \(E[u \mid X] = 0\) No omitted confounders Biased estimates
Homoskedasticity \(\text{Var}(u \mid X) = \sigma^2\) Constant error variance SEs are wrong
(Normality) \(u \mid X \sim \mathcal{N}(0,\sigma^2)\) Normal errors Exact inference only

Assumptions 1 & 3 broken → wrong answer even in large samples Assumption 4 broken → right answer, wrong uncertainty

Why \(E[u \mid X] = 0\) fails: five routes

  • Omitted variable bias — a relevant determinant of \(y\) is left in \(u\) and correlated with \(X\) (e.g. ability in wage regressions)
  • Measurement error in regressors — if \(X\) is observed with error correlated with the true regressor, estimates are attenuated or biased
  • Simultaneity / reverse causality\(X\) affects \(y\) and \(y\) affects \(X\); the regressor is jointly determined with the outcome (price–quantity, income–health)
  • Dynamic models with serially correlated errors — including \(y_{t-1}\) as a regressor while \(u_t\) is autocorrelated makes the lag endogenous
  • Functional form misspecification — wrong transformation or omitted interactions push systematic structure into \(u\), which then correlates with included regressors

All five routes share one consequence: \(\hat{\beta}\) is biased and inconsistent — wrong even in large samples.

Our diagnostic dashboard

One function covers five of the six checks:

performance::check_model(model_base)

Each panel targets one assumption — or one practical robustness concern (influential observations). We will go through them in order.

Our diagnostic dashboard — output

Functional Form

Linearity in parameters ≠ linearity in variables

OLS requires the model to be linear in the parameters — not in the raw variables.

Transformation Interpretation of \(\hat{\beta}\)
y ~ x (level–level) +1 unit of x → \(+\hat{\beta}\) units of y
log(y) ~ x (log–level) +1 unit of x → \(+(\hat{\beta} \times 100)\%\) change in y
y ~ log(x) (level–log) +1% change in x → \(+\hat{\beta}/100\) units of y
log(y) ~ log(x) (log–log) +1% change in x → \(+\hat{\beta}\%\) change in y
y ~ x + I(x^2) (quadratic) Relationship has a turning point

You already used log transformations in Task 1 — this session extends that logic.

Residuals vs. fitted: the primary diagnostic

The red smoother should be flat. A systematic curve means something is missing.

Hotels-europe: linear vs. log-log

The RESET test

Formal test for functional form misspecification (lmtest::resettest()).

resettest(m_linear)

    RESET test

data:  m_linear
RESET = 7.1487, df1 = 2, df2 = 9644, p-value = 0.00079

Warning

A rejection tells you something is wrong — not what to change. Use the residual plot to guide the fix; don’t let the test prescribe it.

Fix 1 — Log transformation for a multiplicative relationship

lm(log(price) ~ log(distance + 0.1), data = hotels)

Interpretation: \(\hat{\beta}\) is an elasticity — a 1% increase in \(x\) is associated with a \(\hat{\beta}\)% change in \(y\). Example: log(price) ~ log(distance) with \(\hat{\beta} = -0.12\) → “10% further from centre → price 1.2% lower.”

Fix 2 — Quadratic term for a curved relationship

lm(price ~ distance + I(distance^2), data = hotels)

Interpretation: the marginal effect of \(x\) is \(\hat{\beta}_1 + 2\hat{\beta}_2 x\) — it varies with \(x\). Turning point at \(x^* = -\hat{\beta}_1 / (2\hat{\beta}_2)\). Note: with \(\hat{\beta}_2 < 0\), the effect of \(x\) on \(y\) diminishes and eventually reverses.

Heteroskedasticity

Non-constant error variance

OLS assumes \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\) — the same uncertainty everywhere.

What goes wrong when this fails:

  • OLS estimates are still unbiased — the direction and magnitude are correct on average
  • But standard errors are inconsistent — too narrow in some places, too wide in others
  • t-tests and confidence intervals mislead you about precision

Example: hotel prices in Istanbul cluster tightly around the regression line; London hotels scatter much more widely. A single SE cannot represent both groups honestly.

Scale-location plot: the visual test

The red smoother should be flat and horizontal.

Hotels-europe: heteroskedasticity in practice

The Breusch-Pagan test

bptest(model_base)

    studentized Breusch-Pagan test

data:  model_base
BP = 30.437, df = 2, p-value = 2.459e-07
  • Null: homoskedasticity (constant variance)
  • Rejection → heteroskedasticity present — standard errors are not reliable

Note

The test confirms what the plot already showed. Always look at the plot first — the test just formalises the diagnosis.

The fix: Heteroskedasticity-robust (HC) standard errors

  • Replace standard formula with a “sandwich” estimator
  • Uses each observation’s actual squared residual to estimate uncertainty.
  • Result: larger SEs where the model struggles, smaller where it fits well.
    • Estimates do not change.
modelsummary(list("Standard" = model_base, "Robust (HC1)" = model_base),
             vcov = list("classical", "HC1"), stars = TRUE)
Standard Robust (HC1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -30.70*** -30.70***
(5.15) (5.69)
distance -4.65*** -4.65***
(0.48) (0.31)
stars 57.13*** 57.13***
(1.41) (1.84)
Num.Obs. 9648 9648
R2 0.156 0.156
R2 Adj. 0.155 0.155
Std.Errors IID HC1

Rule of thumb: with cross-sectional data, use robust SEs by default.

Multicollinearity

When predictors move together

Perfectly correlated predictors: OLS breaks down

Highly correlated predictors: isolating individual contributions becomes difficult

Consequence:

  • Estimates remain unbiased — this is an efficiency problem, not a bias problem
  • Standard errors inflate — imprecise, unstable estimates
  • Small changes to the sample can flip signs or change magnitudes substantially

Example: imagine adding both stars (hotel classification) and rating (guest satisfaction score) to the model. Both measure “quality” — how to check whether this is problematic?

VIF: the Variance Inflation Factor

VIF: by how much is variance of \(\hat{\beta}_j\) inflated relative to a model with no correlation among predictors.

\[\text{VIF}_j = \frac{1}{1 - R^2_j}\]

\(R^2_j\): the \(R^2\) from regressing predictor \(j\) on all other predictors.

VIF Interpretation
1 No inflation — predictor is uncorrelated with others
1–5 Mild — generally acceptable
5–10 Moderate concern — inspect carefully
> 10 Serious — standard errors substantially inflated

Hotels-europe: low vs. problematic VIF

Base model — distance and stars measure different things: VIF is fine.

vif(model_base)   # price ~ distance + stars
distance    stars 
1.003385 1.003385 

Redundant model — a researcher includes both stars and log(stars) to “try both functional forms.” They are near-identical (r ≈ 0.97): VIF explodes.

model_collinear <- lm(price ~ distance + stars + log(stars), data = hotels)
vif(model_collinear)
  distance      stars log(stars) 
  1.004235  20.543854  20.526349 

Warning

Including a variable and a monotonic transformation of it simultaneously is the most common source of severe collinearity. The model cannot separate their effects — the SEs become unreliable. Drop one or the other.

What to do about multicollinearity

  • Do not drop variables mechanically based on VIF — think about what each variable represents
  • Dropping an important control to reduce VIF can introduce omitted variable bias (a far worse problem)
  • If two variables genuinely measure the same construct, keep only one — or combine them into a single index (e.g. a simple average, or the first principal component)
  • Sometimes collinearity is just a feature of reality — acknowledge it and note that individual coefficients should be interpreted with caution

Warning

Interaction terms (x * z) always produce high VIF for their component variables. This is expected and harmless — centering the variables reduces VIF cosmetically but does not change inference.

Influential Observations

Three types of unusual data points

Type What makes it unusual Does it distort results?
Outlier Large residual — unusual y given x Can do, but not always
High-leverage point Extreme position in x space Can do, but not always
Influential observation Removing it substantially changes the estimates Yes, by definition

The dangerous case: a point that is both an outlier and high-leverage.

Cook’s distance

\[D_i = \frac{(\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})^\top (X^\top X) (\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})}{p \cdot \hat{\sigma}^2}\]

\(D_i\): how much do the estimated coefficients shift when observation \(i\) is removed.

Rule of thumb: \(D_i > 4/n\) flags an observation worth investigating.

Example: one influential point

The dashed line is the original fit. One point pulls the slope down substantially.

Hotels-europe: who are the influential hotels?

augment(model_base) |>
  bind_cols(hotels |> select(hotel_id, city)) |>
  arrange(desc(.cooksd)) |>
  select(hotel_id, city, stars, price, .cooksd) |>
  head(8)
# A tibble: 8 × 5
  hotel_id city      stars price .cooksd
     <dbl> <chr>     <dbl> <dbl>   <dbl>
1    14797 Prague        4  3714  0.0477
2    14797 Prague        4  3714  0.0477
3     1392 Barcelona     1  1213  0.0313
4     1392 Barcelona     1  1213  0.0313
5     8321 London        5  1848  0.0295
6    13394 Paris         5  1364  0.0139
7    14307 Paris         5  1265  0.0112
8    12644 Paris         5  1218  0.0101

Investigate — don’t delete

  • An influential observation is not automatically an error. It may be the most interesting case in the dataset.
  • Ask: why is this point influential? Data entry error, genuine extreme case, or a different sub-population?
  • Workflow: re-estimate without the flagged observations. Report both results and note the sensitivity.
  • Silently removing influential points without disclosure is a serious methodological error.

Normality of Residuals

The Q-Q plot: good vs. violated

A widespread misconception

What normality actually protects: Exact finite-sample inference — t- and F-statistics follow their named distributions exactly only when errors are normal.

Why it rarely matters in practice:

  • With moderate-to-large \(n\): the Central Limit Theorem ensures \(\hat{\beta}\) is approximately normally distributed regardless of the residual distribution
  • For our hotels dataset (\(n \approx 9{,}600\)): normality of residuals is essentially irrelevant for inference
  • When it does matter: very small samples (\(n < 30\)), exact prediction intervals

The Shapiro-Wilk trap

shapiro.test(resid(model_base)[1:5000])  # max n for shapiro.test is 5000

    Shapiro-Wilk normality test

data:  resid(model_base)[1:5000]
W = 0.80384, p-value < 2.2e-16

Warning

With large \(n\), Shapiro-Wilk almost always rejects — it detects trivially small deviations from normality that have no practical consequence for inference.

Do not interpret a significant p-value here as a problem. Check the Q-Q plot visually instead.

The Posterior Predictive Check

What the PPC shows

The PPC simulates new response values from the fitted model many times and compares their distribution to the observed data.

The PPC catches what the other five panels cannot: a fundamental mismatch between the model’s implied data-generating process and the actual distribution of \(y\).

PPC: level-level vs. log-log

  • Blue line: density of the observed outcome \(y\)
  • Green lines: densities of outcomes simulated from the fitted model
  • If green ≈ blue → the model generates plausible data → distributional assumptions are reasonable
  • If they diverge → the model produces values in the wrong range → mis-specified DGP

Dashboard panels 1–2: functional form and variance

  • Linearity — residuals vs. fitted; the red smoother should be flat. Systematic curve → functional form problem
  • Homogeneity of variance — scale-location; the smoother should be horizontal. Remaining fan shape → use robust SEs even with the log model

Dashboard panels 3–4: collinearity and normality

  • Collinearity (VIF) — bars should be short (< 5). Both predictors are fine; recall the stars + log(stars) example where VIF ≈ 20
  • Normality of residuals (Q-Q) — much closer to the diagonal than model_base (skewness −0.1 vs. 7.8); with \(n > 9{,}000\) the remaining departure does not affect inference

Dashboard panels 5–6: influential observations and overall fit

  • Outliers / influential observations — Cook’s D bar and leverage–residuals scatter; points outside the contour lines warrant investigation (report with and without, do not silently delete)
  • Posterior predictive check — green lines now closely track the blue observed density. The log transformation resolved the distributional mismatch we saw with model_base.

What Diagnostics Cannot Detect

What residuals cannot tell you

All diagnostic panels show properties of the residuals — the part of \(y\) the model did not explain.

But every route to endogeneity (\(E[u \mid X] \neq 0\)) is invisible in the residuals:

  • Omitted variable bias — a confounder is absorbed into \(\hat{\beta}\); the fit looks fine
  • Measurement error in regressors — attenuation bias hides behind well-behaved residuals
  • Simultaneity / reverse causality — the endogenous regressor leaves no obvious residual pattern

All three share the same logic: the bias is inside \(\hat{\beta}\), not visible in \(\hat{u}\).

We focus on OVB — the most common case in business and economics research.

The mechanism

Suppose the true model is:

\[y = \beta_0 + \beta_1 x + \beta_2 z + \varepsilon\]

but we estimate \(y = \beta_0 + \beta_1 x + \varepsilon\) (omitting \(z\)).

The estimated \(\hat{\beta}_1\) absorbs part of the effect of \(z\) — and is biased unless \(z\) is uncorrelated with \(x\).

Direction-of-bias rule (no algebra needed):

\[\text{sign(bias)} = \text{sign}(\text{effect of } z \text{ on } y) \times \text{sign}(\text{corr}(z, x))\]

OVB in practice: what changes when we add city?

m_no_city   <- lm(log(price) ~ log(distance + 0.1), data = hotels)
m_with_city <- lm(log(price) ~ log(distance + 0.1) + city, data = hotels)
Without city With city controls
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
log(distance) -0.064*** -0.214***
(0.008) (0.007)
Num.Obs. 9648 9648
R2 0.007 0.370
R2 Adj. 0.007 0.370
City FE No Yes

The distance coefficient changes substantially — because city drives both price level and how far hotels are from the centre.

Why the residuals looked fine

Run the full diagnostic suite on the biased model (no city controls):

Both panels pass \(\rightarrow\) residuals scatter randomly, variance looks stable. Yet we know from the previous slide that \(\hat{\beta}_\text{distance}\) is wrong.

Important

The bias was absorbed into \(\hat{\beta}\) when the model was fitted — the residuals have already been “cleaned” of it. OVB is invisible to any diagnostic tool. The solution is research design, not better diagnostics.

The way forward

Panel data and fixed effects (panel data session):

  • If the omitted variable is time-invariant (e.g. which city), fixed effects eliminate it
  • One of the most powerful practical solutions to OVB

Causal identification (optional session on causality):

  • Randomisation, quasi-experiments
  • Address OVB at the design stage, not the estimation stage

The transition from describing patterns to making causal claims is one of the most important methodological steps in applied data science.

Summary

The diagnostic workflow

For every regression, ask:

  1. Residuals vs. fitted — is the functional form correct? Is there a pattern?
  2. Scale-location — is the variance constant? Do I need robust SEs?
  3. Cook’s distance / leverage — are a few observations driving my results?
  4. VIF — are my predictors too correlated to separate their effects?
  5. Q-Q plot — are there serious departures from normality? (Large \(n\): rarely a concern)
  6. Thinking — what important variable might I have left out?
performance::check_model(your_model)  # panels 1–5 in one call

The sixth question — OVB — requires subject-matter knowledge, not a function.