Data Wrangling
Chapter 3.4 - 3.9

Today’s goals


  1. More piping!
  2. group data based on variables
  3. create a new variable using mutate
  4. subset a dataset based on columns
  5. arrange data in both ascending and descending order

Artwork by @allison_horst

Data Wrangling

In conjunction with past tools (filter() and summarize()) we will use:

  • group_by()
  • mutate()
  • arrange()
  • select()
  • and a few others.

Rarely do we only use one of these tools in isolation. We “pipe” them together to wrangle data.

group_by()

group_by(variable) takes an existing data frame and converts it into a grouped data frame

  • operations are performed by group
  • must be used in conjunction with another operator, Ex: summarize()
  • can group by multiple criteria but must be included in the same function call

Example: group_by()

Calculate the average body mass of penguins by species.

penguins %>% 
  group_by(species) %>% 
  summarize(mean_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
  species   mean_mass
  <fct>         <dbl>
1 Adelie        3701.
2 Chinstrap     3733.
3 Gentoo        5076.
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Calculate the average body mass of male and female penguins for each species. (ie: calculate the average body mass of penguins by species and sex.)

penguins %>% 
  group_by(species, sex) %>% 
  summarize(mean_mass = mean(body_mass_g))
# A tibble: 8 × 3
# Groups:   species [3]
  species   sex    mean_mass
  <fct>     <fct>      <dbl>
1 Adelie    female     3369.
2 Adelie    male       4043.
3 Adelie    <NA>         NA 
4 Chinstrap female     3527.
5 Chinstrap male       3939.
6 Gentoo    female     4680.
7 Gentoo    male       5485.
8 Gentoo    <NA>         NA 

mutate()

mutate( ) create a new variable (in the same dataset)

Example:

Create a new variable body_mass_lb where we calculate the penguins weight in pounds. Hint: \(pounds \approx grams/453.6\)

penguins <- penguins %>%
  mutate(body_mass_lb = body_mass_g/453.6)
glimpse(penguins)
Rows: 344
Columns: 9
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
$ body_mass_lb      <dbl> 8.267196, 8.377425, 7.164903, NA, 7.605820, 8.046737…

Notice we overwrote the penguins dataset. You can ONLY overwrite the original data if you are not changing any of the original information. Otherwise store it as something NEW.

arrange()

  • arrange(variable) sorts data in ascending (low to high; A to Z) order by default.
  • arrange(desc(variable)) sorts data in descending (high to low; Z to A) order.

Example:

penguins %>%
  arrange(desc(bill_length_mm))
# A tibble: 344 × 9
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo    Biscoe           59.6          17                 230        6050
 2 Chinstrap Dream            58            17.8               181        3700
 3 Gentoo    Biscoe           55.9          17                 228        5600
 4 Chinstrap Dream            55.8          19.8               207        4000
 5 Gentoo    Biscoe           55.1          16                 230        5850
 6 Gentoo    Biscoe           54.3          15.7               231        5650
 7 Chinstrap Dream            54.2          20.8               201        4300
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Gentoo    Biscoe           53.4          15.8               219        5500
10 Chinstrap Dream            52.8          20                 205        4550
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_lb <dbl>

select()

select() subsets the dataset based on named columns

Sometimes datasets contain a LOT of information that you do not need for your analysis. select() will help us simplify the dataset and make it easier to use.

In the penguins dataset, say we are only interested in the body measurement columns.

penguins_body <- penguins %>% 
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)
# specify variables to remove
penguins_body <- penguins %>% 
  select(-species, -island, -sex, -year)
# same thing different notation
penguins_body <- penguins %>% 
  select(-c(species, island, sex, year) )
head(penguins_body)
# A tibble: 6 × 5
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g body_mass_lb
           <dbl>         <dbl>             <int>       <int>        <dbl>
1           39.1          18.7               181        3750         8.27
2           39.5          17.4               186        3800         8.38
3           40.3          18                 195        3250         7.16
4           NA            NA                  NA          NA        NA   
5           36.7          19.3               193        3450         7.61
6           39.3          20.6               190        3650         8.05

Other Useful Data Wrangling Functions

  • count(variable) : shortcut to group_by %>% summarize(n())
  • slice_max(variable, n = 1) : return the largest “n” values
  • slice_min(variable, n = 1) : return the smallest “n” values

Data Wrangling Example

What if we want to start with our penguins dataset, then keep only observations that are not missing their body_mass_g, and then calculate the count by species, and then order the dataset from low to high.

a)
penguins %>% 
  filter(!is.na(body_mass_g)) %>% 
  group_by(species) %>% 
  summarize(count = n()) %>% 
  arrange(count)
b)
penguins %>% 
  select(body_mass_g) %>% 
  group_by(species) %>% 
  summarize(count = n()) %>% 
  arrange(count)
c)
penguins %>% 
  filter(!is.na(body_mass_g)) %>% 
  group_by(species) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
d)
penguins %>% 
  select(!is.na(body_mass_g)) %>% 
  summarize(count = n()) %>% 
  group_by(species) %>% 
  arrange(count)