data %>%
step 1 %>%
step 2Artwork by @allison_horst
In real data applications it is rare that data will be in the “perfect” form for us to use. We will typically need to manipulate or wrangle data into a form that is useful using the dplyr library.
The ‘d’ stands for data frames, and the ‘plyr’ is the name of another package that the R developers called pliers. Think of it as a set of metaphorical pliers used for working with data sets and data frames.
%>% : takes the output of one function and passes it into another function as an argument.filter() is used to subset a data frame, retaining all rows that satisfy your conditions.Operators for the filter Function
| Operator | Definition |
|---|---|
== |
equal |
> |
greater than |
>= |
greater than or equal to |
| Operator | Definition |
|---|---|
!= |
not equal |
< |
less than |
<= |
less than or equal to |
Consider the penguins dataset.
Subset the dataset to only include ‘male’ penguins.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Consider the penguins dataset.
Subset the dataset to only include ‘male’ penguins.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Remember to store your data with a new meaningful name.
Consider the penguins dataset.
Subset the dataset to only include ‘male’ penguins.
Rows: 168
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.3, 39.2, 38.6, 34.6, 42.5, 46.0, 37.7, 38.2…
$ bill_depth_mm <dbl> 18.7, 20.6, 19.6, 21.2, 21.1, 20.7, 21.5, 18.7, 18.1…
$ flipper_length_mm <int> 181, 190, 195, 191, 198, 197, 194, 180, 185, 180, 18…
$ body_mass_g <int> 3750, 3650, 4675, 3800, 4400, 4500, 4200, 3600, 3950…
$ sex <fct> male, male, male, male, male, male, male, male, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Remember to store your data with a new meaningful name.
Subset the dataset to only include ‘male’ penguins that weigh more than 4000 grams.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Subset the dataset to only include ‘male’ penguins that weigh more than 4000 grams.
and can be represented by either a comma , or the & symbol
Note: Since we already created an only males dataset you could have started with that and proceeded to filter based on weight.
Rows: 109
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Dream, D…
$ bill_length_mm <dbl> 39.2, 34.6, 42.5, 46.0, 39.2, 39.8, 44.1, 39.6, 42.3…
$ bill_depth_mm <dbl> 19.6, 21.1, 20.7, 21.5, 21.1, 19.1, 19.7, 18.8, 21.2…
$ flipper_length_mm <int> 195, 198, 197, 194, 196, 184, 196, 190, 191, 188, 20…
$ body_mass_g <int> 4675, 4400, 4500, 4200, 4150, 4650, 4400, 4600, 4150…
$ sex <fct> male, male, male, male, male, male, male, male, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
ALWAYS inspect your new dataset to make sure everything is as expected.
summarize() function reduces a data frame to a summary of just the specified variable.
Useful summary statistics:
Now we can add calculations to describe the center and spread of our distribution plots!
![]()
Which of the following summary statistics are invariant to outliers?
a) mean
b) median
c) sd
d) IQR
How would the standard deviation of a distribution change if the maximum value was doubled?
a) increase
b) decrease
c) stay the same
d) not enough information
![]()
Consider the penguins dataset.
Summarize the average flipper_length_mm and count for penguins that are either “Adelie” species or “Gentoo” species. Which of the following is correct?
# a)
penguins_summary_a <- penguins %>%
filter(species == "Adelie" | "Gentoo") %>%
summarize(mean = mean(flipper_length_mm),
count = n() )
# b)
penguins_summary_b <- penguins %>%
filter(species == "Adelie" | species == "Gentoo") %>%
summarize(mean = mean(flipper_length_mm),
count = n() )
#c)
penguins_summary_c <- penguins %>%
summarize(mean = mean(flipper_length_mm),
count = n() ) %>%
filter(species == "Adelie" & species == "Gentoo")
#d)
penguins_summary_d <- penguins %>%
filter(species == "Adelie" & species == "Gentoo") %>%
summarize(mean = mean(flipper_length_mm),
count = n(flipper_length_mm) )What happens if a variable has missing data?
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172 | 190 | 197 | 213 | 231 | ▂▇▃▅▂ |
Either set na.rm = TRUE OR filter out the missing data before summarizing.
& and | in the same filter be very careful and use parenthesis.NA in a data frame cell and can produce an error when using summarize unless you set na.rm = TRUE (ie: remove the NA values).