Multiple Regression: Going Beyond the Basics

Session 2 — 09 April 2026

Prof. Dr. Claudius Gräbner-Radkowitsch

Europa-Universität Flensburg

2026-04-09

Today’s plan


Time Activity
16:00–16:10 Debriefing of take-home task 1
16:10–16:30 Input lecture: categorical variables and interactions
16:30–17:00 Live coding: expanding take-home 1
17:00–17:15 Break
17:15–18:15 In-session exercise: hotel prices in Vienna
18:15–18:20 Short break
18:20–18:45 Debriefing the exercise
18:45–19:00 Introduce Take-home Task 2, Q&A

Debriefing Take Home 1

Take-home task 1: what you did

Dataset: CPS 2014 — market research analysts and marketing specialists

What you built:

  • Model 1: earnings ~ age (linear)
  • Model 2: earnings ~ age + age² (quadratic — better residual pattern)
  • Model 3: earnings ~ age + age² + hours + college (with controls)

Take-home task 1: the results


M1: Age M2: Quadratic M3: Full model
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
CPS 2014. Market research analysts, age 24–64, ≥ 20 hrs/week.
Age 14.365*** 110.171*** 85.030**
(3.728) (29.541) (25.737)
Age² -1.142** -0.830**
(0.349) (0.305)
Weekly hours 37.572***
(4.631)
College degree 313.155**
(97.216)
Constant 724.712*** -1133.072+ -2497.224***
(151.674) (587.592) (530.026)
Num.Obs. 252 252 252
R2 0.056 0.095 0.327

Be aware: Adding uhours and college shifted the age coefficients!

Common issues: interpretation

Too vague: “The college coefficient is 128.4.”

Too mechanical: “When college equals one, earnings increase by 128.4, holding other variables constant.”

Good: “Analysts with at least a college degree earn, on average, approximately $128 more per week than those without, holding age and weekly hours constant.”

Tip

Always answer: compared to whom? By how much? In what units? Holding what constant?

Two new tools for Session 2

From Task 1 onwards you explored results using raw R output.

Session 2 introduces two tools that turn raw output into presentation-ready tables:

Tool Purpose
modelsummary Regression tables — one function for all models
flextable Data summaries — any tibble, cleanly formatted

Both render in HTML and PDF without any manual formatting. You will use both in Take-home Task 2.

Raw output using summary()

summary(m1)

Call:
lm(formula = earnwke ~ age, data = cps)

Residuals:
    Min      1Q  Median      3Q     Max 
-1317.7  -465.2  -107.9   289.4  1772.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  724.712    151.674   4.778 3.02e-06 ***
age           14.365      3.728   3.854 0.000148 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 660.1 on 250 degrees of freedom
Multiple R-squared:  0.05608,   Adjusted R-squared:  0.0523 
F-statistic: 14.85 on 1 and 250 DF,  p-value: 0.0001479

Not nice, not straightforward to compare several models

modelsummary: before

modelsummary(list(m1, m2, m3), gof_map = c("nobs", "r.squared"))
(1) (2) (3)
(Intercept) 724.712 -1133.072 -2497.224
(151.674) (587.592) (530.026)
age 14.365 110.171 85.030
(3.728) (29.541) (25.737)
age_sq -1.142 -0.830
(0.349) (0.305)
uhours 37.572
(4.631)
college 313.155
(97.216)
Num.Obs. 252 252 252
R2 0.056 0.095 0.327

Column headers auto-named; row labels are raw R names: age_sq, (Intercept), college

modelsummary: after

modelsummary(
  list("M1: Age" = m1, "M2: Quadratic" = m2, "M3: Full model" = m3),
  coef_map = c(
    "age"="Age",  "age_sq"="Age²", "uhours" = "Weekly hours", 
    "college" = "College degree", "(Intercept)" = "Constant"),
  gof_map  = c("nobs", "r.squared"),
  notes    = "CPS 2014. Market research analysts, age 24–64."
)
M1: Age M2: Quadratic M3: Full model
CPS 2014. Market research analysts, age 24–64.
Age 14.365 110.171 85.030
(3.728) (29.541) (25.737)
Age² -1.142 -0.830
(0.349) (0.305)
Weekly hours 37.572
(4.631)
College degree 313.155
(97.216)
Constant 724.712 -1133.072 -2497.224
(151.674) (587.592) (530.026)
Num.Obs. 252 252 252
R2 0.056 0.095 0.327

flextable: before

Without flextable — raw tibble printed to the console:

cps |>
  group_by(edu_cat) |>
  summarise(mean_earn = mean(earnwke))
# A tibble: 3 × 2
  edu_cat    mean_earn
  <fct>          <dbl>
1 No degree      1014.
2 Bachelor's     1321.
3 Graduate       1972.

→ R’s default formatting: variable names as headers, too many decimals, no units.

flextable: after

cps |>
  group_by(edu_cat) |>
  summarise(`Mean weekly earnings (USD)` = round(mean(earnwke), 1)) |>
  rename(`Education level` = edu_cat) |>
  flextable() |>
  theme_vanilla() |>
  autofit()

Education level

Mean weekly earnings (USD)

No degree

1,014.1

Bachelor's

1,321.2

Graduate

1,972.4

  • rename() — readable column headers before passing to flextable
  • round() — control decimal places in the summary step
  • theme_vanilla() — clean, publication-style formatting
  • autofit() — adjusts column widths to content automatically

The answer to Reflection Question 1

“Your analysis left out one important individual characteristic that is systematically linked to earnings. Which variable is this?”

Gender.

tidy(cor.test(cps$female, cps$earnwke)) |>
  select(estimate, p.value, conf.low, conf.high) |>
  mutate(across(everything(), \(x) round(x, 3))) |> 
  flextable() |> 
  theme_vanilla()

estimate

p.value

conf.low

conf.high

-0.158

0.012

-0.276

-0.035

Today we learn more about how to properly include categorial variables such as gender

Categorial variables: introduction

What is a categorical variable?

A variable that takes a finite set of named values — not a continuous scale.

Variable Values
Gender Male, Female
Education No degree, Bachelor’s, Graduate
Hotel star rating 3-star, 4-star, 5-star
Country Germany, France, Denmark, …

The problem: You cannot pass the text label "Female" to lm().

The solution: Dummy variables — convert each category to a 0/1 indicator.

How R handles categories

R picks one category as the reference group and creates one binary (0/1) indicator for each remaining category.

Example: Gender (reference = Male)

Person Sex female
Anna Female 1
Ben Male 0
Claudia Female 1
cps <- cps |> mutate(female = ifelse(sex == 2, 1, 0))
# OR: use factor(sex) and let R create the dummies automatically

Tip

Using factor() with the levels argument lets you set the reference group manually.

The first element in the levels vector will be the reference group (otherwise: defaults to alphabetical order).

Interpreting a dummy coefficient

Model: earnwke ~ age + uhours + female

Term Estimate Std. Error p-value
(Intercept) -853.89 250.56 0.00
age 13.01 3.28 0.00
uhours 39.99 4.84 0.00
female -69.09 76.55 0.37

→ Female analysts earn, on average, approximately $69 per week less than male analysts of the same age and hours.

Where do dummy coefficients come from?

In a simple regression of earnings on gender only, the coefficient equals the raw mean difference:

cps |>
  group_by(female) |>
  summarise(mean_earn = round(mean(earnwke), 1)) |>
  mutate(diff_vs_male = round(mean_earn - mean_earn[female == 0], 1)) |> 
  flextable() |> 
  theme_vanilla()

female

mean_earn

diff_vs_male

0

1,417.7

0.0

1

1,199.3

-218.4

coef(lm(earnwke ~ female, data = cps))["female"] |> round(1)
female 
-218.4 

Important

With controls (age, uhours), the coefficient changes — it is the difference holding other variables constant, not the raw group gap.

Multi-category variables

What if we want education as three levels, not a binary college/no-college split?

  • Pick a reference category (e.g. “No degree”)
  • R creates k − 1 dummy variables for k categories
  • Each coefficient = difference from the reference group
cps |>
  group_by(edu_cat) |>
  summarise(mean_earn = round(mean(earnwke), 1)) |>
  mutate(vs_no_degree = round(mean_earn - mean_earn[edu_cat == "No degree"], 1)) |>
  flextable() |>
  theme_vanilla() |> 
  autofit()

edu_cat

mean_earn

vs_no_degree

No degree

1,014.1

0.0

Bachelor's

1,321.2

307.1

Graduate

1,972.4

958.3

Why the reference category matters

Same model, different reference — different numbers, same underlying story.

  • Reference = No degree: Bachelor +$80, Graduate +$180
  • Reference = Bachelor: No degree −$80, Graduate +$100
  • Reference = Graduate: No degree −$180, Bachelor −$100

How does R choose the reference by default?

R orders factor levels alphabetically — the first level alphabetically becomes the reference.

factor(c("Graduate", "No degree", "Bachelor"))
# Levels: Bachelor's  Graduate  No degree   ← Bachelor is reference (B < G < N)

Easiest fix: set levels manually

factor(c("Graduate", "No degree", "Bachelor's"), levels = c("No degree", "Bachelor", "Graduate"))
# Levels: Bachelor's  Graduate  No degree   ← No degree is reference (N < B < G)

Ex-post override: specify the order explicitly or use relevel():

factor(edu, levels = c("No degree", "Bachelor's", "Graduate"))
# Now "No degree" is the reference
relevel(factor(edu), ref = "No degree")  # same result

Interaction terms

Interaction terms: motivation

A regression without interactions assumes: The effect of age on earnings is the same for men and women.

But is that realistic?

  • Career interruptions are not equally distributed across genders
  • Seniority and promotion patterns differ
  • The age–earnings profile may be steeper for men

An interaction term lets the slope of age differ by gender.

Visualising an interaction

cps |>
  mutate(gender = ifelse(female == 1, "Female", "Male")) |>
  ggplot(aes(x = age, y = earnwke, colour = gender)) +
  geom_point(alpha = 0.2, size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_manual(values = c("Male" = "#00395B", "Female" = "#69AACD")) +
  labs(x = "Age", y = "Weekly earnings (USD)", colour = NULL,
       title = "Are the age–earnings slopes the same for men and women?") +
  theme_minimal()

But do the slopes really differ significantly, or is this really just a level difference plus noise? 🤔

Interpreting an interaction coefficient

Model: earnwke ~ age * female (shorthand for: age + female + age:female)

Term Estimate Std. Error p-value
(Intercept) 689.545 235.542 0.004
age 18.543 5.766 0.001
female 68.082 305.564 0.824
age:female -7.229 7.500 0.336
  • age: slope of age for men — each additional year adds $[age] to male earnings
  • female: level difference at age = 0 — not directly interpretable
  • age:female: extra slope for women — how much the age slope differs for women

The marginal effect of age

With interactions, “the effect of age” depends on the group.

Term Estimate Std. Error p-value
(Intercept) 689.545 235.542 0.004
age 18.543 5.766 0.001
female 68.082 305.564 0.824
age:female -7.229 7.500 0.336
b <- coef(lm(earnwke ~ age * female, data = cps))
  • Age slope for men: round(b["age"], 2) = 18.54
  • Age slope for women: round(b["age"] + b["age:female"], 2) = 11.31
  • Difference: round(b["age:female"], 2) = -7.23

\[\frac{\partial \text{earnwke}}{\partial \text{age}} = \hat{\beta}_\text{age} + \hat{\beta}_{\text{age:female}} \times \text{female}\]

→ Always report group-specific slopes — the interaction coefficient alone is difficult to communicate.

Live coding demo

Live coding demo

We return to the CPS data and extend the take-home analysis:

  1. Add female back from the raw CPS file
  2. Fit Model 3 from the take-home, now including gender
  3. Recode grade92 from binary to three education categories
  4. Visualise: do male and female age-earnings profiles look parallel?
  5. Add an age × female interaction
  6. Compare all models in one clean modelsummary table

Note

The full code and output are available as lecture notes on the course website. You do not need to take notes — focus on understanding the steps.

Exercise

Exercise: hotel prices in Vienna

Business question: What drives hotel prices — and does the price penalty for being far from the city centre differ between budget and luxury hotels?

Dataset: 428 hotels in Vienna (Békés & Kézdi 2021) Download from OSF: osf.io/4e6d8

Your tasks:

  1. Explore the data — understand what one observation represents
  2. Model price as a function of distance and stars (numeric, then as a factor)
  3. Add an interaction: does the distance effect vary by star category?
  4. Present all models in a clean modelsummary table
  5. Write a plain-language paragraph for a non-technical manager

Take-Home Task 2

Take-Home Task 2

Assigned today — due 30 April 2026, 08:00

Dataset: Hotels Europe — prices and features across 46 European cities (Békés & Kézdi 2021)

  • Build four regression models of increasing complexity (distance → city type → stars → interaction)
  • Use modelsummary for regression tables and flextable for descriptive statistics
  • Visualise the price–distance relationship before interpreting the model
  • Reflect critically on what your model leaves out

Full task description distributed via GitHub Classroom (link on the course website).

See you on 30 April

Next session: Thu, 30 April — Modelling binary and categorical outcomes

Questions during the break? GitHub Discussions

Task 2 deadline: 29 April 2026, 23:59

The lecture notes extending Task 1 are available now on the course website.