penguins %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE))# A tibble: 3 × 2
species mean_mass
<fct> <dbl>
1 Adelie 3701.
2 Chinstrap 3733.
3 Gentoo 5076.
Artwork by @allison_horst
In conjunction with past tools (filter() and summarize()) we will use:
group_by()mutate()arrange()select()Rarely do we only use one of these tools in isolation. We “pipe” them together to wrangle data.
group_by()group_by(variable) takes an existing data frame and converts it into a grouped data frame
group_by()Calculate the average body mass of penguins by species.
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Calculate the average body mass of male and female penguins for each species. (ie: calculate the average body mass of penguins by species and sex.)
# A tibble: 8 × 3
# Groups: species [3]
species sex mean_mass
<fct> <fct> <dbl>
1 Adelie female 3369.
2 Adelie male 4043.
3 Adelie <NA> NA
4 Chinstrap female 3527.
5 Chinstrap male 3939.
6 Gentoo female 4680.
7 Gentoo male 5485.
8 Gentoo <NA> NA
mutate()mutate( ) create a new variable (in the same dataset)
Example:
Create a new variable body_mass_lb where we calculate the penguins weight in pounds. Hint: \(pounds \approx grams/453.6\)
Rows: 344
Columns: 9
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ body_mass_lb <dbl> 8.267196, 8.377425, 7.164903, NA, 7.605820, 8.046737…
Notice we overwrote the penguins dataset. You can ONLY overwrite the original data if you are not changing any of the original information. Otherwise store it as something NEW.
arrange()arrange(variable) sorts data in ascending (low to high; A to Z) order by default.arrange(desc(variable)) sorts data in descending (high to low; Z to A) order.Example:
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 59.6 17 230 6050
2 Chinstrap Dream 58 17.8 181 3700
3 Gentoo Biscoe 55.9 17 228 5600
4 Chinstrap Dream 55.8 19.8 207 4000
5 Gentoo Biscoe 55.1 16 230 5850
6 Gentoo Biscoe 54.3 15.7 231 5650
7 Chinstrap Dream 54.2 20.8 201 4300
8 Chinstrap Dream 53.5 19.9 205 4500
9 Gentoo Biscoe 53.4 15.8 219 5500
10 Chinstrap Dream 52.8 20 205 4550
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_lb <dbl>
select() select() subsets the dataset based on named columns
Sometimes datasets contain a LOT of information that you do not need for your analysis. select() will help us simplify the dataset and make it easier to use.
In the penguins dataset, say we are only interested in the body measurement columns.
count(variable) : shortcut to group_by %>% summarize(n())slice_max(variable, n = 1) : return the largest “n” valuesslice_min(variable, n = 1) : return the smallest “n” values![]()
What if we want to start with our penguins dataset, then keep only observations that are not missing their body_mass_g, and then calculate the count by species, and then order the dataset from low to high.
a)
b)
c)
d)
