Data Wrangling
Chapter 3.0 - 3.3

Today’s goals

Understand the pipe operator
Subset rows of a dataset based on conditions
Calculate summary statistics of a variable

Artwork by @allison_horst

Data Wrangling

In real data applications it is rare that data will be in the “perfect” form for us to use. We will typically need to manipulate or wrangle data into a form that is useful using the dplyr library.

The ‘d’ stands for data frames, and the ‘plyr’ is the name of another package that the R developers called pliers. Think of it as a set of metaphorical pliers used for working with data sets and data frames.

The pipe operator %>% : takes the output of one function and passes it into another function as an argument.

data %>% 
  step 1 %>% 
  step 2

filter()

filter() is used to subset a data frame, retaining all rows that satisfy your conditions.

Operators for the filter Function

Operator	Definition
`==`	equal
`>`	greater than
`>=`	greater than or equal to

Operator	Definition
`!=`	not equal
`<`	less than
`<=`	less than or equal to

Example 1

Consider the penguins dataset.

Subset the dataset to only include ‘male’ penguins.

Code
Data

penguins %>% 
  filter(            )

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Example 1

Consider the penguins dataset.

Subset the dataset to only include ‘male’ penguins.

Code
Data

penguins %>% 
  filter(sex == "male")

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Remember to store your data with a new meaningful name.

Example 1

Consider the penguins dataset.

Subset the dataset to only include ‘male’ penguins.

Code
New Data

penguins_male <- penguins %>% 
  filter(sex == "male")

glimpse(penguins_male)

Rows: 168
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.3, 39.2, 38.6, 34.6, 42.5, 46.0, 37.7, 38.2…
$ bill_depth_mm     <dbl> 18.7, 20.6, 19.6, 21.2, 21.1, 20.7, 21.5, 18.7, 18.1…
$ flipper_length_mm <int> 181, 190, 195, 191, 198, 197, 194, 180, 185, 180, 18…
$ body_mass_g       <int> 3750, 3650, 4675, 3800, 4400, 4500, 4200, 3600, 3950…
$ sex               <fct> male, male, male, male, male, male, male, male, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Remember to store your data with a new meaningful name.

Example 2

Subset the dataset to only include ‘male’ penguins that weigh more than 4000 grams.

Code
Data

penguins_male <- penguins %>% 
  filter(sex == "male")

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Example 2

Subset the dataset to only include ‘male’ penguins that weigh more than 4000 grams.

Code
Alternate Code
New Data

penguins_heavy_m <- penguins %>% 
  filter(sex == "male", body_mass_g > 4000)

and can be represented by either a comma , or the & symbol

penguins_heavy_m <- penguins %>% 
  filter(sex == "male" & body_mass_g > 4000)

Note: Since we already created an only males dataset you could have started with that and proceeded to filter based on weight.

penguins_heavy_m <- penguins_male %>% 
  filter(body_mass_g > 4000)

glimpse(penguins_heavy_m)

Rows: 109
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Dream, D…
$ bill_length_mm    <dbl> 39.2, 34.6, 42.5, 46.0, 39.2, 39.8, 44.1, 39.6, 42.3…
$ bill_depth_mm     <dbl> 19.6, 21.1, 20.7, 21.5, 21.1, 19.1, 19.7, 18.8, 21.2…
$ flipper_length_mm <int> 195, 198, 197, 194, 196, 184, 196, 190, 191, 188, 20…
$ body_mass_g       <int> 4675, 4400, 4500, 4200, 4150, 4650, 4400, 4600, 4150…
$ sex               <fct> male, male, male, male, male, male, male, male, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

ALWAYS inspect your new dataset to make sure everything is as expected.

summarize()

summarize() function reduces a data frame to a summary of just the specified variable.

Useful summary statistics:

data_sumamry <- data %>% 
  summarize(mean = mean(variable),
            median = median(variable),
            sd = sd(variable),
            iqr = IQR(variable),
            count = n())

Now we can add calculations to describe the center and spread of our distribution plots!

Understanding summary statistics

Which of the following summary statistics are invariant to outliers?

a) mean     
b) median   
c) sd    
d) IQR

How would the standard deviation of a distribution change if the maximum value was doubled?

a) increase
b) decrease
c) stay the same
d) not enough information

Example 3

Consider the penguins dataset.

Summarize the average flipper_length_mm and count for penguins that are either “Adelie” species or “Gentoo” species. Which of the following is correct?

# a)
penguins_summary_a <- penguins %>% 
  filter(species == "Adelie" | "Gentoo") %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() )
# b)
penguins_summary_b <- penguins %>% 
  filter(species == "Adelie" | species == "Gentoo") %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() )
#c)
penguins_summary_c <- penguins %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() ) %>% 
  filter(species == "Adelie" & species == "Gentoo")
#d)
penguins_summary_d <- penguins %>% 
  filter(species == "Adelie" & species == "Gentoo") %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n(flipper_length_mm) )

Example 4

What happens if a variable has missing data?

Missingness
Summarize

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
flipper_length_mm	2	0.99	200.92	14.06	172	190	197	213	231	▂▇▃▅▂

penguins_summary <- penguins %>% 
  summarize(mean = mean(flipper_length_mm),
            count = n() )
penguins_summary

# A tibble: 1 × 2
   mean count
  <dbl> <int>
1    NA   344

Example 4 (continued)

Either set na.rm = TRUE OR filter out the missing data before summarizing.

na.rm
filter

penguins_summary <- penguins %>% 
  summarize(mean = mean(flipper_length_mm, na.rm = TRUE),
            count = n() )
penguins_summary

# A tibble: 1 × 2
   mean count
  <dbl> <int>
1  201.   344

penguins_summary <- penguins %>% 
  filter(!is.na(flipper_length_mm)) %>%
  summarize(mean = mean(flipper_length_mm),
            count = n() )
penguins_summary

# A tibble: 1 × 2
   mean count
  <dbl> <int>
1  201.   342

Important information

Order of pipe operations matters!
If you are using & and | in the same filter be very careful and use parenthesis.
Missing data is entered as NA in a data frame cell and can produce an error when using summarize unless you set na.rm = TRUE (ie: remove the NA values).

Data Wrangling Chapter 3.0 - 3.3

Today’s goals

Data Wrangling

filter()

Example 1

Example 1

Example 1

Example 2

Example 2

summarize()

Understanding summary statistics

Example 3

Example 4

Example 4 (continued)

Important information

Data Wrangling
Chapter 3.0 - 3.3