Multiple Regression: Going Beyond the Basics

Session 2 — 09 April 2026

Prof. Dr. Claudius Gräbner-Radkowitsch

Europa-Universität Flensburg

2026-04-09

Today’s plan

Time	Activity
16:00–16:10	Debriefing of take-home task 1
16:10–16:30	Input lecture: categorical variables and interactions
16:30–17:00	Live coding: expanding take-home 1
17:00–17:15	Break
17:15–18:15	In-session exercise: hotel prices in Vienna
18:15–18:20	Short break
18:20–18:45	Debriefing the exercise
18:45–19:00	Introduce Take-home Task 2, Q&A

Debriefing Take Home 1

Take-home task 1: what you did

Dataset: CPS 2014 — market research analysts and marketing specialists

What you built:

Model 1: earnings ~ age (linear)
Model 2: earnings ~ age + age² (quadratic — better residual pattern)
Model 3: earnings ~ age + age² + hours + college (with controls)

Take-home task 1: the results

	M1: Age	M2: Quadratic	M3: Full model
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
CPS 2014. Market research analysts, age 24–64, ≥ 20 hrs/week.
Age	14.365***	110.171***	85.030**
	(3.728)	(29.541)	(25.737)
Age²		-1.142**	-0.830**
		(0.349)	(0.305)
Weekly hours			37.572***
			(4.631)
College degree			313.155**
			(97.216)
Constant	724.712***	-1133.072+	-2497.224***
	(151.674)	(587.592)	(530.026)
Num.Obs.	252	252	252
R2	0.056	0.095	0.327

Be aware: Adding uhours and college shifted the age coefficients!

Common issues: interpretation

Too vague: “The college coefficient is 128.4.”

Too mechanical: “When college equals one, earnings increase by 128.4, holding other variables constant.”

Good: “Analysts with at least a college degree earn, on average, approximately $128 more per week than those without, holding age and weekly hours constant.”

Tip

Always answer: compared to whom? By how much? In what units? Holding what constant?

Two new tools for Session 2

From Task 1 onwards you explored results using raw R output.

Session 2 introduces two tools that turn raw output into presentation-ready tables:

Tool	Purpose
`modelsummary`	Regression tables — one function for all models
`flextable`	Data summaries — any tibble, cleanly formatted

Both render in HTML and PDF without any manual formatting. You will use both in Take-home Task 2.

Raw output using `summary()`

summary(m1)


Call:
lm(formula = earnwke ~ age, data = cps)

Residuals:
    Min      1Q  Median      3Q     Max 
-1317.7  -465.2  -107.9   289.4  1772.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  724.712    151.674   4.778 3.02e-06 ***
age           14.365      3.728   3.854 0.000148 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 660.1 on 250 degrees of freedom
Multiple R-squared:  0.05608,   Adjusted R-squared:  0.0523 
F-statistic: 14.85 on 1 and 250 DF,  p-value: 0.0001479

Not nice, not straightforward to compare several models

`modelsummary`: before

modelsummary(list(m1, m2, m3), gof_map = c("nobs", "r.squared"))

	(1)	(2)	(3)
(Intercept)	724.712	-1133.072	-2497.224
	(151.674)	(587.592)	(530.026)
age	14.365	110.171	85.030
	(3.728)	(29.541)	(25.737)
age_sq		-1.142	-0.830
		(0.349)	(0.305)
uhours			37.572
			(4.631)
college			313.155
			(97.216)
Num.Obs.	252	252	252
R2	0.056	0.095	0.327

Column headers auto-named; row labels are raw R names: age_sq, (Intercept), college

`modelsummary`: after

modelsummary(
  list("M1: Age" = m1, "M2: Quadratic" = m2, "M3: Full model" = m3),
  coef_map = c(
    "age"="Age",  "age_sq"="Age²", "uhours" = "Weekly hours", 
    "college" = "College degree", "(Intercept)" = "Constant"),
  gof_map  = c("nobs", "r.squared"),
  notes    = "CPS 2014. Market research analysts, age 24–64."
)

	M1: Age	M2: Quadratic	M3: Full model
CPS 2014. Market research analysts, age 24–64.
Age	14.365	110.171	85.030
	(3.728)	(29.541)	(25.737)
Age²		-1.142	-0.830
		(0.349)	(0.305)
Weekly hours			37.572
			(4.631)
College degree			313.155
			(97.216)
Constant	724.712	-1133.072	-2497.224
	(151.674)	(587.592)	(530.026)
Num.Obs.	252	252	252
R2	0.056	0.095	0.327

`flextable`: before

Without flextable — raw tibble printed to the console:

cps |>
  group_by(edu_cat) |>
  summarise(mean_earn = mean(earnwke))

# A tibble: 3 × 2
  edu_cat    mean_earn
  <fct>          <dbl>
1 No degree      1014.
2 Bachelor's     1321.
3 Graduate       1972.

→ R’s default formatting: variable names as headers, too many decimals, no units.

`flextable`: after

cps |>
  group_by(edu_cat) |>
  summarise(`Mean weekly earnings (USD)` = round(mean(earnwke), 1)) |>
  rename(`Education level` = edu_cat) |>
  flextable() |>
  theme_vanilla() |>
  autofit()

Education level	Mean weekly earnings (USD)
No degree	1,014.1
Bachelor's	1,321.2
Graduate	1,972.4

rename() — readable column headers before passing to flextable
round() — control decimal places in the summary step
theme_vanilla() — clean, publication-style formatting
autofit() — adjusts column widths to content automatically

The answer to Reflection Question 1

“Your analysis left out one important individual characteristic that is systematically linked to earnings. Which variable is this?”

Gender.

tidy(cor.test(cps$female, cps$earnwke)) |>
  select(estimate, p.value, conf.low, conf.high) |>
  mutate(across(everything(), \(x) round(x, 3))) |> 
  flextable() |> 
  theme_vanilla()

estimate	p.value	conf.low	conf.high
-0.158	0.012	-0.276	-0.035

Today we learn more about how to properly include categorial variables such as gender

Categorial variables: introduction

What is a categorical variable?

A variable that takes a finite set of named values — not a continuous scale.

Variable	Values
Gender	Male, Female
Education	No degree, Bachelor’s, Graduate
Hotel star rating	3-star, 4-star, 5-star
Country	Germany, France, Denmark, …

The problem: You cannot pass the text label "Female" to lm().

The solution: Dummy variables — convert each category to a 0/1 indicator.

How R handles categories

R picks one category as the reference group and creates one binary (0/1) indicator for each remaining category.

Example: Gender (reference = Male)

Person	Sex	`female`
Anna	Female	1
Ben	Male	0
Claudia	Female	1

cps <- cps |> mutate(female = ifelse(sex == 2, 1, 0))
# OR: use factor(sex) and let R create the dummies automatically

Tip

Using factor() with the levels argument lets you set the reference group manually.

The first element in the levels vector will be the reference group (otherwise: defaults to alphabetical order).

Interpreting a dummy coefficient

Model: earnwke ~ age + uhours + female

Term	Estimate	Std. Error	p-value
(Intercept)	-853.89	250.56	0.00
age	13.01	3.28	0.00
uhours	39.99	4.84	0.00
female	-69.09	76.55	0.37

→ Female analysts earn, on average, approximately $69 per week less than male analysts of the same age and hours.

Where do dummy coefficients come from?

In a simple regression of earnings on gender only, the coefficient equals the raw mean difference:

cps |>
  group_by(female) |>
  summarise(mean_earn = round(mean(earnwke), 1)) |>
  mutate(diff_vs_male = round(mean_earn - mean_earn[female == 0], 1)) |> 
  flextable() |> 
  theme_vanilla()

female	mean_earn	diff_vs_male
0	1,417.7	0.0
1	1,199.3	-218.4

coef(lm(earnwke ~ female, data = cps))["female"] |> round(1)

female 
-218.4

Important

With controls (age, uhours), the coefficient changes — it is the difference holding other variables constant, not the raw group gap.

Multi-category variables

What if we want education as three levels, not a binary college/no-college split?

Pick a reference category (e.g. “No degree”)
R creates k − 1 dummy variables for k categories
Each coefficient = difference from the reference group

cps |>
  group_by(edu_cat) |>
  summarise(mean_earn = round(mean(earnwke), 1)) |>
  mutate(vs_no_degree = round(mean_earn - mean_earn[edu_cat == "No degree"], 1)) |>
  flextable() |>
  theme_vanilla() |> 
  autofit()

edu_cat	mean_earn	vs_no_degree
No degree	1,014.1	0.0
Bachelor's	1,321.2	307.1
Graduate	1,972.4	958.3

Why the reference category matters

Same model, different reference — different numbers, same underlying story.

Reference = No degree: Bachelor +$80, Graduate +$180
Reference = Bachelor: No degree −$80, Graduate +$100
Reference = Graduate: No degree −$180, Bachelor −$100

How does R choose the reference by default?

R orders factor levels alphabetically — the first level alphabetically becomes the reference.

factor(c("Graduate", "No degree", "Bachelor"))
# Levels: Bachelor's  Graduate  No degree   ← Bachelor is reference (B < G < N)

Easiest fix: set levels manually

factor(c("Graduate", "No degree", "Bachelor's"), levels = c("No degree", "Bachelor", "Graduate"))
# Levels: Bachelor's  Graduate  No degree   ← No degree is reference (N < B < G)

Ex-post override: specify the order explicitly or use relevel():

factor(edu, levels = c("No degree", "Bachelor's", "Graduate"))
# Now "No degree" is the reference
relevel(factor(edu), ref = "No degree")  # same result

Interaction terms

Interaction terms: motivation

A regression without interactions assumes: The effect of age on earnings is the same for men and women.

But is that realistic?

Career interruptions are not equally distributed across genders
Seniority and promotion patterns differ
The age–earnings profile may be steeper for men

An interaction term lets the slope of age differ by gender.

Visualising an interaction

cps |>
  mutate(gender = ifelse(female == 1, "Female", "Male")) |>
  ggplot(aes(x = age, y = earnwke, colour = gender)) +
  geom_point(alpha = 0.2, size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_colour_manual(values = c("Male" = "#00395B", "Female" = "#69AACD")) +
  labs(x = "Age", y = "Weekly earnings (USD)", colour = NULL,
       title = "Are the age–earnings slopes the same for men and women?") +
  theme_minimal()

But do the slopes really differ significantly, or is this really just a level difference plus noise? 🤔

Interpreting an interaction coefficient

Model: earnwke ~ age * female (shorthand for: age + female + age:female)

Term	Estimate	Std. Error	p-value
(Intercept)	689.545	235.542	0.004
age	18.543	5.766	0.001
female	68.082	305.564	0.824
age:female	-7.229	7.500	0.336

age: slope of age for men — each additional year adds $[age] to male earnings
female: level difference at age = 0 — not directly interpretable
age:female: extra slope for women — how much the age slope differs for women

The marginal effect of age

With interactions, “the effect of age” depends on the group.

Term	Estimate	Std. Error	p-value
(Intercept)	689.545	235.542	0.004
age	18.543	5.766	0.001
female	68.082	305.564	0.824
age:female	-7.229	7.500	0.336

b <- coef(lm(earnwke ~ age * female, data = cps))

Age slope for men: round(b["age"], 2) = 18.54
Age slope for women: round(b["age"] + b["age:female"], 2) = 11.31
Difference: round(b["age:female"], 2) = -7.23

\[\frac{\partial \text{earnwke}}{\partial \text{age}} = \hat{\beta}_\text{age} + \hat{\beta}_{\text{age:female}} \times \text{female}\]

→ Always report group-specific slopes — the interaction coefficient alone is difficult to communicate.

Live coding demo

We return to the CPS data and extend the take-home analysis:

Add female back from the raw CPS file
Fit Model 3 from the take-home, now including gender
Recode grade92 from binary to three education categories
Visualise: do male and female age-earnings profiles look parallel?
Add an age × female interaction
Compare all models in one clean modelsummary table

Note

The full code and output are available as lecture notes on the course website. You do not need to take notes — focus on understanding the steps.

Exercise

Exercise: hotel prices in Vienna

Business question: What drives hotel prices — and does the price penalty for being far from the city centre differ between budget and luxury hotels?

Dataset: 428 hotels in Vienna (Békés & Kézdi 2021) Download from OSF: osf.io/4e6d8

Your tasks:

Explore the data — understand what one observation represents
Model price as a function of distance and stars (numeric, then as a factor)
Add an interaction: does the distance effect vary by star category?
Present all models in a clean modelsummary table
Write a plain-language paragraph for a non-technical manager

Take-Home Task 2

Assigned today — due 30 April 2026, 08:00

Dataset: Hotels Europe — prices and features across 46 European cities (Békés & Kézdi 2021)

Build four regression models of increasing complexity (distance → city type → stars → interaction)
Use modelsummary for regression tables and flextable for descriptive statistics
Visualise the price–distance relationship before interpreting the model
Reflect critically on what your model leaves out

Full task description distributed via GitHub Classroom (link on the course website).

See you on 30 April

Next session: Thu, 30 April — Modelling binary and categorical outcomes

Questions during the break? GitHub Discussions

Task 2 deadline: 29 April 2026, 23:59

The lecture notes extending Task 1 are available now on the course website.

Multiple Regression: Going Beyond the Basics

Today’s plan

Debriefing Take Home 1

Take-home task 1: what you did

Take-home task 1: the results

Common issues: interpretation

Two new tools for Session 2

Raw output using summary()

modelsummary: before

modelsummary: after

flextable: before

flextable: after

The answer to Reflection Question 1

Categorial variables: introduction

What is a categorical variable?

How R handles categories

Interpreting a dummy coefficient

Where do dummy coefficients come from?

Multi-category variables

Why the reference category matters

Interaction terms

Interaction terms: motivation

Visualising an interaction

Interpreting an interaction coefficient

The marginal effect of age

Live coding demo

Live coding demo

Exercise

Exercise: hotel prices in Vienna

Take-Home Task 2

Take-Home Task 2

See you on 30 April

Raw output using `summary()`

`modelsummary`: before

`modelsummary`: after

`flextable`: before

`flextable`: after