9 Data Filtering

9.1 Introduction & packages

This tutorial explains how to filter, select, and wrangle data in R using both Tidyverse and Base R approaches.

Both approaches are valid, but they differ in syntax and philosophy. Tidyverse functions (from dplyr) use a pipeline-based approach, whereas Base R relies on indexing and built-in functions.

# Load necessary packages
library(tidyverse)  # Core data wrangling
library(palmerpenguins) # Example dataset
library(skimr)  # Quick summary stats
library(GGally)  # Pair plots
library(ggplot2) # Visualization

We will use the penguins dataset from the palmerpenguins package to demonstrate data wrangling techniques.

data("penguins")  # Load dataset

9.2 Selecting COLUMNS

9.2.1 Selecting a specific column using $

Sometimes we want to deal with only one specific column in our dataset, for example applying the mean command to say just one column.

To do this, we use the $ symbol. For example, here I’m simply selecting the data in the elevation column only and saving it to a new variable called elevationdata.

mean_mass <- penguins$body_mass_g 

mean_mass

Try it yourself. You should have seen that as you typed the $, it gave you all the available column names to choose from. This means we can now easily summarise specific columns. For example:

summary(penguins) will create a summary of the whole spreadsheet,
summary(penguins$Price) will only summarise the Price column.
mean(penguins$Price) will take the mean of the Price column in the HousesNY dataframe.

9.2.2 Selecting multiple columns

Remember you have the names() command to help find these and that the names are CASE SENSITIVE. If we want to work with several specific columns (e.g., species, bill_length_mm, and flipper_length_mm), we can extract them like this:

Tidyverse Approach

selected_data <- penguins %>% dplyr::select(species, bill_length_mm, flipper_length_mm)
head(selected_data)

## # A tibble: 6 × 3
##   species bill_length_mm flipper_length_mm
##   <fct>            <dbl>             <int>
## 1 Adelie            39.1               181
## 2 Adelie            39.5               186
## 3 Adelie            40.3               195
## 4 Adelie            NA                  NA
## 5 Adelie            36.7               193
## 6 Adelie            39.3               190

Base R Approach

selected_data <- penguins[, c("species", "bill_length_mm", "flipper_length_mm")]
head(selected_data)

## # A tibble: 6 × 3
##   species bill_length_mm flipper_length_mm
##   <fct>            <dbl>             <int>
## 1 Adelie            39.1               181
## 2 Adelie            39.5               186
## 3 Adelie            40.3               195
## 4 Adelie            NA                  NA
## 5 Adelie            36.7               193
## 6 Adelie            39.3               190

9.3 SELECTING SPECIFIC CELLS

Sometimes, we do not want to analyse at the entire data.frame. Instead, we would like to only look at one (or more) columns or rows.

There are several ways we can select data.

To choose a specific column, we can use the $ symbol to select its name (as described above)
If you know which number rows or columns you want, you can use square brackets to numerically select data. Essentially our data follows the matrix format format:

\[ tablename [ ROWS , COLUMNS ] \]

Some examples:

# This will select the data in the 5th row and 7th column
penguins[5,7]

# This will select the 2nd row and ALL the columns 
penguins[2,]

# This will select the 3rd column and ALL the rows
penguins[,3]
# similar to using its name
penguins$island

# We can combine our commands, this will print the 13th row of the body mass column 
# (no comma as we're only looking at one column)
penguins$body_mass_g[13] 

# The : symbol lets you choose a sequence of numbers e.g. 1:5 is 1 2 3 4 5
# So this prints out rows 11 to 15 and all the columns
penguins[11:15,]

# The "c" command allows you to enter whatever numbers you like.  
# So this will print out rows 4,3,7 and the "Elevation" and "Dist_to_Coast" columns
penguins[c(4,3,7), c("island","body_mass_g")]

9.3.1 Deleting data

Or if you know the row or column number you can use the minus - sign to remove. Or use filter..

# remove row 6 and and overwrite
penguins <- penguins[-6 ,]

# remove columns 4 and 2 and save result to newdata 
newdata <- penguins[, - c(2,4) ]

9.4 Selecting ROWS

Filtering means selecting only the rows that meet certain conditions.

which() helps find row numbers that match a condition.
filter() (from dplyr tidyverse) allows for clear, intuitive filtering of data.
Logical operators (&, |, !=, %in%) provide additional flexibility in specifying conditions.

9.4.1 Random sampling

Dplyr’s sample_slice command will do this for us. Here I randomly selected 4 rows.

penguinsample <-  dplyr::slice_sample(penguins, n=4)

#check it worked
glimpse(penguinsample)

9.4.2 Tidyverse/dplyr filter command

We can also use the tidyverse approach, the dplyr::filter() function. All the condition symbols e.g. !=NOT etc also work for the which command,

Selecting all Adelie penguins:

dplyr::filter(penguins, species == "Adelie")

Selecting multiple species using %in%:

dplyr::filter(penguins, species %in% c("Adelie", "Chinstrap"))

Excluding a species using !=:

dplyr::filter(penguins, species != "Gentoo")

Filtering penguins with a flipper length greater than 200 mm:

dplyr::filter(penguins, flipper_length_mm > 200)

Filtering penguins with a body mass less than or equal to 3000 g:

light_penguins <- dplyr::filter(penguins, body_mass_g <= 3000)
summary(light_penguins)

Filtering with multiple conditions:

dplyr::filter(penguins, flipper_length_mm > 180 & species == "Adelie")

Using | (OR condition) to select penguins with either a bill_depth_mm greater than 18 or flipper_length_mm longer than 210 mm:

9.4.3 BaseR `which()` command

The which() command helps us identify row indices that meet a condition.

For example, identifying numbers greater than 107 in a sequence:

a <- 100:110
which(a > 107)

I like it because you can read the command as a sentence e.g. which numbers in “a” are greater than 107, Or which penguins have a body_mass_g less than or equal to 3000g:

outlier_rows <- which(penguins$body_mass_g <= 3000)

To find the row corresponding to the penguin with the smallest bill length:

row_number <- which(penguins$bill_length_mm == min(penguins$bill_length_mm, na.rm = TRUE))
smallest_bill <- penguins[row_number, ]
smallest_bill

9.5 Saving data to new columns

Lets say you wanted to find the ratio between two columns of data, then save your answer as a new column

Base R Approach

penguins$bill_ratio <- penguins$bill_length_mm / penguins$bill_depth_mm

Tidyverse Approach

penguins <- penguins %>% mutate(bill_ratio = bill_length_mm / bill_depth_mm)

9.6 Sorting Data

Sorting the data allows us to find the highest and lowest values in a dataset.

Sort by body mass (ascending):

Tidyverse Approach

penguins <- arrange(penguins, body_mass_g)

Sort by descending order:

penguins <- arrange(penguins, desc(flipper_length_mm))

Base R Approach

penguins <- penguins[order(penguins$body_mass_g), ]

penguins <- penguins[order(-penguins$flipper_length_mm), ]

9.7 Combining Everything

We can combine multiple operations into a single pipeline. The tidyverse is designed for this.

Tidyverse Approach

final_result <- penguins %>%
  filter(flipper_length_mm > 180) %>%
  mutate(bill_ratio = bill_length_mm / bill_depth_mm) %>%
  select(species, bill_length_mm, flipper_length_mm, bill_ratio) %>%
  arrange(desc(bill_ratio))

Base R Approach

filtered <- penguins[penguins$flipper_length_mm > 180, ]
filtered$bill_ratio <- filtered$bill_length_mm / filtered$bill_depth_mm

filtered <- filtered[order(-filtered$bill_ratio),
                     c("species", "bill_length_mm", 
                       "flipper_length_mm", "bill_ratio")]
head(filtered)

8 Plots

10 Probability distributions

9 Data Filtering

9.1 Introduction & packages

9.2 Selecting COLUMNS

9.2.1 Selecting a specific column using $

9.2.2 Selecting multiple columns

Tidyverse Approach

Base R Approach

9.3 SELECTING SPECIFIC CELLS

9.3.1 Deleting data

9.4 Selecting ROWS

9.4.1 Random sampling

9.4.2 Tidyverse/dplyr filter command

9.4.3 BaseR which() command

9.5 Saving data to new columns

Base R Approach

Tidyverse Approach

9.6 Sorting Data

Tidyverse Approach

Base R Approach

9.7 Combining Everything

Tidyverse Approach

Base R Approach

9.4.3 BaseR `which()` command