Regression
Chapter 5.2 - 5.4

Today’s goals


  1. Regression model with single categorical explanatory variable
  2. Understand what an offset is

Simple Categorical Linear Regression

  • Definition: Simple Categorical Linear Regression models the relationship between a response (y) variable and one categorical explanatory (x) variable.
  • Now the model will give us differences by categories relative to a baseline for comparison.

Example

Consider the Palmer Penguins dataset.

We want to use species to predict body_mass_g. In other words we think penguins body mass varies by species.

Recall our dataset has 3 species: Adelie, Gentoo, Chinstrap

#Build a model
model_species <- lm(body_mass_g ~ species, data = penguins)
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Example

Consider the Palmer Penguins dataset.

We want to use species to predict body_mass_g. In other words we think penguins body mass varies by species.

Recall our dataset has 3 species: Adelie, Gentoo, Chinstrap

#Build a model
model_species <- lm(body_mass_g ~ species, data = penguins)

summary(model_species)

Call:
lm(formula = body_mass_g ~ species, data = penguins)

Residuals:
     Min       1Q   Median       3Q      Max 
-1126.02  -333.09   -33.09   316.91  1223.98 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3700.66      37.62   98.37   <2e-16 ***
speciesChinstrap    32.43      67.51    0.48    0.631    
speciesGentoo     1375.35      56.15   24.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462.3 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6697,    Adjusted R-squared:  0.6677 
F-statistic: 343.6 on 2 and 339 DF,  p-value: < 2.2e-16
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Regression Equation

Below is the condensed summary output.

summary(model_species)$coefficients
                   Estimate Std. Error    t value      Pr(>|t|)
(Intercept)      3700.66225   37.61935 98.3712321 2.488024e-251
speciesChinstrap   32.42598   67.51168  0.4803018  6.313226e-01
speciesGentoo    1375.35401   56.14797 24.4951686  5.420612e-77

What is the baseline species?


What is the regression equation?

Indicator function

Will take on either the value of “1” or “0” depending if the condition is true.

\[\begin{equation} \mathbb{1}_{gentoo}(x) = \begin{cases} 1 & \text{if x is Gentoo}\\ 0 & \text{if x is NOT Gentoo} \end{cases} \end{equation}\]

Indicator function

Recall our model:

\[\widehat{bodymass} = b_0 + b_1*\mathbb{1}_{chinstrap}(x)+ b_2*\mathbb{1}_{gentoo}(x)\]

Consider 3 randomly sampled Penguins below.

# A tibble: 3 × 7
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Gentoo  Biscoe              41.7          14.7               210        4700
2 Adelie  Torgersen           45.8          18.9               197        4150
3 Gentoo  Biscoe              45.1          14.4               210        4400
# ℹ 1 more variable: sex <fct>

When using the model equation which of the above penguins would you set \(\mathbb{1}_{gentoo}(x) = 1?\) What about \(\mathbb{1}_{chinstrap}(x) = 1?\)

Interpretation

\[\widehat{bodymass} = b_0 + b_1*\mathbb{1}_{chinstrap}(x)+ b_2*\mathbb{1}_{gentoo}(x)\]

  • Interpretation of \(b_0\): expected value of y for the baseline; the expected body mass for Adelie species is 3700.7

  • Interpretation of \(b_1\) : offset in intercept; expected body mass for Chinstrap species is 32.4 more on average than Adelie species

  • Interpretation of \(b_2\) : offset in intercept; expected body mass for Gentoo species is 1375.4 more on average than Adelie species.

  • Which penguin species weighs the least on average?

  • Which penguin species weighs the most on average?

Extra information

  • There will always be \(n-1\) coefficients for the categorical variable, where \(n\) is the number of categories.

    • The penguins had 3 species categories so there were 2 species coefficients.
  • Categorical variable coefficients will always be an offset to either the intercept or slope. Because you can’t multiply 32.42*gentoo!