#Build a model
model_species <- lm(body_mass_g ~ species, data = penguins)Consider the Palmer Penguins dataset.
We want to use species to predict body_mass_g. In other words we think penguins body mass varies by species.
Recall our dataset has 3 species: Adelie, Gentoo, Chinstrap
| Name | penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Consider the Palmer Penguins dataset.
We want to use species to predict body_mass_g. In other words we think penguins body mass varies by species.
Recall our dataset has 3 species: Adelie, Gentoo, Chinstrap
Call:
lm(formula = body_mass_g ~ species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3700.66 37.62 98.37 <2e-16 ***
speciesChinstrap 32.43 67.51 0.48 0.631
speciesGentoo 1375.35 56.15 24.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
| Name | penguins |
| Number of rows | 344 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
| year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
![]()
Below is the condensed summary output.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3700.66225 37.61935 98.3712321 2.488024e-251
speciesChinstrap 32.42598 67.51168 0.4803018 6.313226e-01
speciesGentoo 1375.35401 56.14797 24.4951686 5.420612e-77
What is the baseline species?
What is the regression equation?
Will take on either the value of “1” or “0” depending if the condition is true.
\[\begin{equation} \mathbb{1}_{gentoo}(x) = \begin{cases} 1 & \text{if x is Gentoo}\\ 0 & \text{if x is NOT Gentoo} \end{cases} \end{equation}\]Recall our model:
\[\widehat{bodymass} = b_0 + b_1*\mathbb{1}_{chinstrap}(x)+ b_2*\mathbb{1}_{gentoo}(x)\]
Consider 3 randomly sampled Penguins below.
# A tibble: 3 × 7
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 41.7 14.7 210 4700
2 Adelie Torgersen 45.8 18.9 197 4150
3 Gentoo Biscoe 45.1 14.4 210 4400
# ℹ 1 more variable: sex <fct>
When using the model equation which of the above penguins would you set \(\mathbb{1}_{gentoo}(x) = 1?\) What about \(\mathbb{1}_{chinstrap}(x) = 1?\)
\[\widehat{bodymass} = b_0 + b_1*\mathbb{1}_{chinstrap}(x)+ b_2*\mathbb{1}_{gentoo}(x)\]
Interpretation of \(b_0\): expected value of y for the baseline; the expected body mass for Adelie species is 3700.7
Interpretation of \(b_1\) : offset in intercept; expected body mass for Chinstrap species is 32.4 more on average than Adelie species
Interpretation of \(b_2\) : offset in intercept; expected body mass for Gentoo species is 1375.4 more on average than Adelie species.
Which penguin species weighs the least on average?
Which penguin species weighs the most on average?
There will always be \(n-1\) coefficients for the categorical variable, where \(n\) is the number of categories.
Categorical variable coefficients will always be an offset to either the intercept or slope. Because you can’t multiply 32.42*gentoo!
