Data Wrangling
Chapter 3.0 - 3.3

Today’s goals


  1. Understand the pipe operator
  2. Subset rows of a dataset based on conditions
  3. Calculate summary statistics of a variable

Artwork by @allison_horst

Data Wrangling

In real data applications it is rare that data will be in the “perfect” form for us to use. We will typically need to manipulate or wrangle data into a form that is useful using the dplyr library.

The ‘d’ stands for data frames, and the ‘plyr’ is the name of another package that the R developers called pliers. Think of it as a set of metaphorical pliers used for working with data sets and data frames.

  • The pipe operator %>% : takes the output of one function and passes it into another function as an argument.
data %>% 
  step 1 %>% 
  step 2

filter()

  • filter() is used to subset a data frame, retaining all rows that satisfy your conditions.

Operators for the filter Function

Operator Definition
== equal
> greater than
>= greater than or equal to
Operator Definition
!= not equal
< less than
<= less than or equal to

Example 1

Consider the penguins dataset.

Subset the dataset to only include ‘male’ penguins.

penguins %>% 
  filter(            )
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Example 1

Consider the penguins dataset.

Subset the dataset to only include ‘male’ penguins.

penguins %>% 
  filter(sex == "male")
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…


Remember to store your data with a new meaningful name.

Example 1

Consider the penguins dataset.

Subset the dataset to only include ‘male’ penguins.

penguins_male <- penguins %>% 
  filter(sex == "male")
glimpse(penguins_male)
Rows: 168
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.3, 39.2, 38.6, 34.6, 42.5, 46.0, 37.7, 38.2…
$ bill_depth_mm     <dbl> 18.7, 20.6, 19.6, 21.2, 21.1, 20.7, 21.5, 18.7, 18.1…
$ flipper_length_mm <int> 181, 190, 195, 191, 198, 197, 194, 180, 185, 180, 18…
$ body_mass_g       <int> 3750, 3650, 4675, 3800, 4400, 4500, 4200, 3600, 3950…
$ sex               <fct> male, male, male, male, male, male, male, male, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…


Remember to store your data with a new meaningful name.

Example 2

Subset the dataset to only include ‘male’ penguins that weigh more than 4000 grams.

penguins_male <- penguins %>% 
  filter(sex == "male")
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Example 2

Subset the dataset to only include ‘male’ penguins that weigh more than 4000 grams.

penguins_heavy_m <- penguins %>% 
  filter(sex == "male", body_mass_g > 4000)

and can be represented by either a comma , or the & symbol

penguins_heavy_m <- penguins %>% 
  filter(sex == "male" & body_mass_g > 4000)


Note: Since we already created an only males dataset you could have started with that and proceeded to filter based on weight.

penguins_heavy_m <- penguins_male %>% 
  filter(body_mass_g > 4000)
glimpse(penguins_heavy_m)
Rows: 109
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Dream, D…
$ bill_length_mm    <dbl> 39.2, 34.6, 42.5, 46.0, 39.2, 39.8, 44.1, 39.6, 42.3…
$ bill_depth_mm     <dbl> 19.6, 21.1, 20.7, 21.5, 21.1, 19.1, 19.7, 18.8, 21.2…
$ flipper_length_mm <int> 195, 198, 197, 194, 196, 184, 196, 190, 191, 188, 20…
$ body_mass_g       <int> 4675, 4400, 4500, 4200, 4150, 4650, 4400, 4600, 4150…
$ sex               <fct> male, male, male, male, male, male, male, male, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…


ALWAYS inspect your new dataset to make sure everything is as expected.

summarize()

summarize() function reduces a data frame to a summary of just the specified variable.


Useful summary statistics:

data_sumamry <- data %>% 
  summarize(mean = mean(variable),
            median = median(variable),
            sd = sd(variable),
            iqr = IQR(variable),
            count = n())

Now we can add calculations to describe the center and spread of our distribution plots!

Understanding summary statistics

Which of the following summary statistics are invariant to outliers?

a) mean     
b) median   
c) sd    
d) IQR

How would the standard deviation of a distribution change if the maximum value was doubled?

a) increase
b) decrease
c) stay the same
d) not enough information

Example 3

Consider the penguins dataset.

Summarize the average flipper_length_mm and count for penguins that are either “Adelie” species or “Gentoo” species. Which of the following is correct?

# a)
penguins_summary_a <- penguins %>% 
  filter(species == "Adelie" | "Gentoo") %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() )
# b)
penguins_summary_b <- penguins %>% 
  filter(species == "Adelie" | species == "Gentoo") %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() )
#c)
penguins_summary_c <- penguins %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() ) %>% 
  filter(species == "Adelie" & species == "Gentoo")
#d)
penguins_summary_d <- penguins %>% 
  filter(species == "Adelie" & species == "Gentoo") %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n(flipper_length_mm) )


Example 4

What happens if a variable has missing data?

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
flipper_length_mm 2 0.99 200.92 14.06 172 190 197 213 231 ▂▇▃▅▂
penguins_summary <- penguins %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() )
penguins_summary
# A tibble: 1 × 2
   mean count
  <dbl> <int>
1    NA   344

Example 4 (continued)

Either set na.rm = TRUE OR filter out the missing data before summarizing.

penguins_summary <- penguins %>% 
  summarize(mean = mean(flipper_length_mm, na.rm = TRUE),
            count = n() )
penguins_summary
# A tibble: 1 × 2
   mean count
  <dbl> <int>
1  201.   344
penguins_summary <- penguins %>% 
  filter(!is.na(flipper_length_mm)) %>%
  summarize(mean = mean(flipper_length_mm),
            count = n() )
penguins_summary
# A tibble: 1 × 2
   mean count
  <dbl> <int>
1  201.   342

Important information

  • Order of pipe operations matters!
  • If you are using & and | in the same filter be very careful and use parenthesis.
  • Missing data is entered as NA in a data frame cell and can produce an error when using summarize unless you set na.rm = TRUE (ie: remove the NA values).