| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | -30.70*** |
| (5.15) | |
| distance | -4.65*** |
| (0.48) | |
| stars | 57.13*** |
| (1.41) | |
| Num.Obs. | 9648 |
| R2 | 0.156 |
| R2 Adj. | 0.155 |
Biases and Diagnostics · Session 4
2026-05-21
We have estimated hotel price as a function of distance and star rating.
| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | -30.70*** |
| (5.15) | |
| distance | -4.65*** |
| (0.48) | |
| stars | 57.13*** |
| (1.41) | |
| Num.Obs. | 9648 |
| R2 | 0.156 |
| R2 Adj. | 0.155 |
Best Linear Unbiased Estimator — OLS is BLUE when its assumptions hold
Unbiasedness = pointing in the right direction on average Efficiency = grouping tightly around the truth
Blue: correct answer, precise. Gold: correct direction, imprecise. Red: wrong answer entirely.
| Assumption | Formula | Plain English | Cost if violated |
|---|---|---|---|
| Linearity | \(y = X\beta + u\) | Correct functional form | Biased estimates |
| No perfect collinearity | \(\text{rank}(X) = k\) | Predictors not identical | Cannot estimate |
| Exogeneity | \(E[u \mid X] = 0\) | No omitted confounders | Biased estimates |
| Homoskedasticity | \(\text{Var}(u \mid X) = \sigma^2\) | Constant error variance | SEs are wrong |
| (Normality) | \(u \mid X \sim \mathcal{N}(0,\sigma^2)\) | Normal errors | Exact inference only |
Assumptions 1 & 3 broken → wrong answer even in large samples Assumption 4 broken → right answer, wrong uncertainty
All five routes share one consequence: \(\hat{\beta}\) is biased and inconsistent — wrong even in large samples.
One function covers five of the six checks:
Each panel targets one assumption — or one practical robustness concern (influential observations). We will go through them in order.
OLS requires the model to be linear in the parameters — not in the raw variables.
| Transformation | Interpretation of \(\hat{\beta}\) |
|---|---|
y ~ x (level–level) |
+1 unit of x → \(+\hat{\beta}\) units of y |
log(y) ~ x (log–level) |
+1 unit of x → \(+(\hat{\beta} \times 100)\%\) change in y |
y ~ log(x) (level–log) |
+1% change in x → \(+\hat{\beta}/100\) units of y |
log(y) ~ log(x) (log–log) |
+1% change in x → \(+\hat{\beta}\%\) change in y |
y ~ x + I(x^2) (quadratic) |
Relationship has a turning point |
You already used log transformations in Task 1 — this session extends that logic.
The red smoother should be flat. A systematic curve means something is missing.
Formal test for functional form misspecification (lmtest::resettest()).
RESET test
data: m_linear
RESET = 7.1487, df1 = 2, df2 = 9644, p-value = 0.00079
Warning
A rejection tells you something is wrong — not what to change. Use the residual plot to guide the fix; don’t let the test prescribe it.
Interpretation: \(\hat{\beta}\) is an elasticity — a 1% increase in \(x\) is associated with a \(\hat{\beta}\)% change in \(y\). Example: log(price) ~ log(distance) with \(\hat{\beta} = -0.12\) → “10% further from centre → price 1.2% lower.”
Interpretation: the marginal effect of \(x\) is \(\hat{\beta}_1 + 2\hat{\beta}_2 x\) — it varies with \(x\). Turning point at \(x^* = -\hat{\beta}_1 / (2\hat{\beta}_2)\). Note: with \(\hat{\beta}_2 < 0\), the effect of \(x\) on \(y\) diminishes and eventually reverses.
OLS assumes \(\text{Var}(\varepsilon_i) = \sigma^2\) for all \(i\) — the same uncertainty everywhere.
What goes wrong when this fails:
Example: hotel prices in Istanbul cluster tightly around the regression line; London hotels scatter much more widely. A single SE cannot represent both groups honestly.
The red smoother should be flat and horizontal.
studentized Breusch-Pagan test
data: model_base
BP = 30.437, df = 2, p-value = 2.459e-07
Note
The test confirms what the plot already showed. Always look at the plot first — the test just formalises the diagnosis.
| Standard | Robust (HC1) | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| (Intercept) | -30.70*** | -30.70*** |
| (5.15) | (5.69) | |
| distance | -4.65*** | -4.65*** |
| (0.48) | (0.31) | |
| stars | 57.13*** | 57.13*** |
| (1.41) | (1.84) | |
| Num.Obs. | 9648 | 9648 |
| R2 | 0.156 | 0.156 |
| R2 Adj. | 0.155 | 0.155 |
| Std.Errors | IID | HC1 |
Rule of thumb: with cross-sectional data, use robust SEs by default.
Perfectly correlated predictors: OLS breaks down
Highly correlated predictors: isolating individual contributions becomes difficult
Consequence:
Example: imagine adding both stars (hotel classification) and rating (guest satisfaction score) to the model. Both measure “quality” — how to check whether this is problematic?
VIF: by how much is variance of \(\hat{\beta}_j\) inflated relative to a model with no correlation among predictors.
\[\text{VIF}_j = \frac{1}{1 - R^2_j}\]
\(R^2_j\): the \(R^2\) from regressing predictor \(j\) on all other predictors.
| VIF | Interpretation |
|---|---|
| 1 | No inflation — predictor is uncorrelated with others |
| 1–5 | Mild — generally acceptable |
| 5–10 | Moderate concern — inspect carefully |
| > 10 | Serious — standard errors substantially inflated |
Base model — distance and stars measure different things: VIF is fine.
Redundant model — a researcher includes both stars and log(stars) to “try both functional forms.” They are near-identical (r ≈ 0.97): VIF explodes.
Warning
Including a variable and a monotonic transformation of it simultaneously is the most common source of severe collinearity. The model cannot separate their effects — the SEs become unreliable. Drop one or the other.
Warning
Interaction terms (x * z) always produce high VIF for their component variables. This is expected and harmless — centering the variables reduces VIF cosmetically but does not change inference.
| Type | What makes it unusual | Does it distort results? |
|---|---|---|
| Outlier | Large residual — unusual y given x | Can do, but not always |
| High-leverage point | Extreme position in x space | Can do, but not always |
| Influential observation | Removing it substantially changes the estimates | Yes, by definition |
The dangerous case: a point that is both an outlier and high-leverage.
\[D_i = \frac{(\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})^\top (X^\top X) (\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})}{p \cdot \hat{\sigma}^2}\]
\(D_i\): how much do the estimated coefficients shift when observation \(i\) is removed.
Rule of thumb: \(D_i > 4/n\) flags an observation worth investigating.
The dashed line is the original fit. One point pulls the slope down substantially.
# A tibble: 8 × 5
hotel_id city stars price .cooksd
<dbl> <chr> <dbl> <dbl> <dbl>
1 14797 Prague 4 3714 0.0477
2 14797 Prague 4 3714 0.0477
3 1392 Barcelona 1 1213 0.0313
4 1392 Barcelona 1 1213 0.0313
5 8321 London 5 1848 0.0295
6 13394 Paris 5 1364 0.0139
7 14307 Paris 5 1265 0.0112
8 12644 Paris 5 1218 0.0101
What normality actually protects: Exact finite-sample inference — t- and F-statistics follow their named distributions exactly only when errors are normal.
Why it rarely matters in practice:
Shapiro-Wilk normality test
data: resid(model_base)[1:5000]
W = 0.80384, p-value < 2.2e-16
Warning
With large \(n\), Shapiro-Wilk almost always rejects — it detects trivially small deviations from normality that have no practical consequence for inference.
Do not interpret a significant p-value here as a problem. Check the Q-Q plot visually instead.
The PPC simulates new response values from the fitted model many times and compares their distribution to the observed data.
The PPC catches what the other five panels cannot: a fundamental mismatch between the model’s implied data-generating process and the actual distribution of \(y\).
stars + log(stars) example where VIF ≈ 20model_base (skewness −0.1 vs. 7.8); with \(n > 9{,}000\) the remaining departure does not affect inferencemodel_base.All diagnostic panels show properties of the residuals — the part of \(y\) the model did not explain.
But every route to endogeneity (\(E[u \mid X] \neq 0\)) is invisible in the residuals:
All three share the same logic: the bias is inside \(\hat{\beta}\), not visible in \(\hat{u}\).
We focus on OVB — the most common case in business and economics research.
Suppose the true model is:
\[y = \beta_0 + \beta_1 x + \beta_2 z + \varepsilon\]
but we estimate \(y = \beta_0 + \beta_1 x + \varepsilon\) (omitting \(z\)).
The estimated \(\hat{\beta}_1\) absorbs part of the effect of \(z\) — and is biased unless \(z\) is uncorrelated with \(x\).
Direction-of-bias rule (no algebra needed):
\[\text{sign(bias)} = \text{sign}(\text{effect of } z \text{ on } y) \times \text{sign}(\text{corr}(z, x))\]
| Without city | With city controls | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| log(distance) | -0.064*** | -0.214*** |
| (0.008) | (0.007) | |
| Num.Obs. | 9648 | 9648 |
| R2 | 0.007 | 0.370 |
| R2 Adj. | 0.007 | 0.370 |
| City FE | No | Yes |
The distance coefficient changes substantially — because city drives both price level and how far hotels are from the centre.
Run the full diagnostic suite on the biased model (no city controls):
Both panels pass \(\rightarrow\) residuals scatter randomly, variance looks stable. Yet we know from the previous slide that \(\hat{\beta}_\text{distance}\) is wrong.
Important
The bias was absorbed into \(\hat{\beta}\) when the model was fitted — the residuals have already been “cleaned” of it. OVB is invisible to any diagnostic tool. The solution is research design, not better diagnostics.
Panel data and fixed effects (panel data session):
Causal identification (optional session on causality):
The transition from describing patterns to making causal claims is one of the most important methodological steps in applied data science.
For every regression, ask: