9 Data Filtering
9.1 Introduction & packages
This tutorial explains how to filter, select, and wrangle data in R using both Tidyverse and Base R approaches.
Both approaches are valid, but they differ in syntax and philosophy. Tidyverse functions (from dplyr) use a pipeline-based approach, whereas Base R relies on indexing and built-in functions.
# Load necessary packages
library(tidyverse) # Core data wrangling
library(palmerpenguins) # Example dataset
library(skimr) # Quick summary stats
library(GGally) # Pair plots
library(ggplot2) # VisualizationWe will use the penguins dataset from the palmerpenguins package to demonstrate data wrangling techniques.
data("penguins") # Load dataset9.2 Selecting COLUMNS
9.2.1 Selecting a specific column using $
Sometimes we want to deal with only one specific column in our dataset, for example applying the mean command to say just one column.
To do this, we use the $ symbol. For example, here I’m simply selecting the data in the elevation column only and saving it to a new variable called elevationdata.
mean_mass <- penguins$body_mass_g
mean_massTry it yourself. You should have seen that as you typed the $, it gave you all the available column names to choose from. This means we can now easily summarise specific columns. For example:
-
summary(penguins)will create a summary of the whole spreadsheet, -
summary(penguins$Price)will only summarise the Price column. -
mean(penguins$Price)will take the mean of the Price column in the HousesNY dataframe.
9.2.2 Selecting multiple columns
Remember you have the names() command to help find these and that the names are CASE SENSITIVE. If we want to work with several specific columns (e.g., species, bill_length_mm, and flipper_length_mm), we can extract them like this:
Tidyverse Approach
selected_data <- penguins %>% dplyr::select(species, bill_length_mm, flipper_length_mm)
head(selected_data)## # A tibble: 6 × 3
## species bill_length_mm flipper_length_mm
## <fct> <dbl> <int>
## 1 Adelie 39.1 181
## 2 Adelie 39.5 186
## 3 Adelie 40.3 195
## 4 Adelie NA NA
## 5 Adelie 36.7 193
## 6 Adelie 39.3 190
Base R Approach
selected_data <- penguins[, c("species", "bill_length_mm", "flipper_length_mm")]
head(selected_data)## # A tibble: 6 × 3
## species bill_length_mm flipper_length_mm
## <fct> <dbl> <int>
## 1 Adelie 39.1 181
## 2 Adelie 39.5 186
## 3 Adelie 40.3 195
## 4 Adelie NA NA
## 5 Adelie 36.7 193
## 6 Adelie 39.3 190
9.3 SELECTING SPECIFIC CELLS
Sometimes, we do not want to analyse at the entire data.frame. Instead, we would like to only look at one (or more) columns or rows.
There are several ways we can select data.
To choose a specific column, we can use the
$symbol to select its name (as described above)If you know which number rows or columns you want, you can use square brackets to numerically select data. Essentially our data follows the matrix format format:
\[
tablename [ ROWS , COLUMNS ]
\]
Some examples:
# This will select the data in the 5th row and 7th column
penguins[5,7]
# This will select the 2nd row and ALL the columns
penguins[2,]
# This will select the 3rd column and ALL the rows
penguins[,3]
# similar to using its name
penguins$island
# We can combine our commands, this will print the 13th row of the body mass column
# (no comma as we're only looking at one column)
penguins$body_mass_g[13]
# The : symbol lets you choose a sequence of numbers e.g. 1:5 is 1 2 3 4 5
# So this prints out rows 11 to 15 and all the columns
penguins[11:15,]
# The "c" command allows you to enter whatever numbers you like.
# So this will print out rows 4,3,7 and the "Elevation" and "Dist_to_Coast" columns
penguins[c(4,3,7), c("island","body_mass_g")]9.3.1 Deleting data
Or if you know the row or column number you can use the minus - sign to remove. Or use filter..
# remove row 6 and and overwrite
penguins <- penguins[-6 ,]
# remove columns 4 and 2 and save result to newdata
newdata <- penguins[, - c(2,4) ]9.4 Selecting ROWS
Filtering means selecting only the rows that meet certain conditions.
-
which()helps find row numbers that match a condition. -
filter()(fromdplyrtidyverse) allows for clear, intuitive filtering of data. - Logical operators (
&,|,!=,%in%) provide additional flexibility in specifying conditions.
9.4.1 Random sampling
Dplyr’s sample_slice command will do this for us. Here I randomly selected 4 rows.
penguinsample <- dplyr::slice_sample(penguins, n=4)
#check it worked
glimpse(penguinsample)9.4.2 Tidyverse/dplyr filter command
We can also use the tidyverse approach, the dplyr::filter() function. All the condition symbols e.g. !=NOT etc also work for the which command,
Selecting all Adelie penguins:
dplyr::filter(penguins, species == "Adelie")Selecting multiple species using %in%:
Excluding a species using !=:
dplyr::filter(penguins, species != "Gentoo")Filtering penguins with a flipper length greater than 200 mm:
dplyr::filter(penguins, flipper_length_mm > 200)Filtering penguins with a body mass less than or equal to 3000 g:
Filtering with multiple conditions:
dplyr::filter(penguins, flipper_length_mm > 180 & species == "Adelie")Using | (OR condition) to select penguins with either a bill_depth_mm greater than 18 or flipper_length_mm longer than 210 mm:
9.4.3 BaseR which() command
The which() command helps us identify row indices that meet a condition.
For example, identifying numbers greater than 107 in a sequence:
a <- 100:110
which(a > 107)I like it because you can read the command as a sentence e.g. which numbers in “a” are greater than 107, Or which penguins have a body_mass_g less than or equal to 3000g:
outlier_rows <- which(penguins$body_mass_g <= 3000)To find the row corresponding to the penguin with the smallest bill length:
row_number <- which(penguins$bill_length_mm == min(penguins$bill_length_mm, na.rm = TRUE))
smallest_bill <- penguins[row_number, ]
smallest_bill9.5 Saving data to new columns
Lets say you wanted to find the ratio between two columns of data, then save your answer as a new column
9.6 Sorting Data
Sorting the data allows us to find the highest and lowest values in a dataset.
Sort by body mass (ascending):
9.7 Combining Everything
We can combine multiple operations into a single pipeline. The tidyverse is designed for this.