This tutorial covers subsetting, filtering, removing missing values..

6.3 Removing missing values

6.3.1 na.omit

The na.omit command will remove any row with ANY missing value. Note I’m overwriting my variable with the smaller one. It’s up to you if you want to do this or make a new variable

frost <- na.omit(frost)


6.3.2 Missing values in single columns

If you want to only remove rows with missing values in a single column, we can use the complete cases command

frost <- frost[complete.cases(frost$Latitude), ]
frost <- frost[complete.cases(frost$Longitude), ]

E.g. take the frost table and ONLY include values where the longitude is not missing. Note, this can be any column name at all.



Turning values into NA

Sometimes, you have -999 or something as NA. You can apply a RULE using the filter command below or like this to make them actually NAs

frost$Dist_to_Coast[frost$Dist_to_Coast < -0]   <- NA

E.g. here I select all the distance to coasts that are less than 0 (e.g impossible) and set them to NA.


6.2 Choosing columns

Sometimes you don’t want all the columns. You can subset in a few ways.

For example, here are the frost data columns

names(frost)
## [1] "Station"             "State"               "Type_Fake"          
## [4] "Avg_DOY_SpringFrost" "Latitude"            "Longitude"          
## [7] "Elevation"           "Dist_to_Coast"

I can choose which column NUMBERS I want like this

# choose columns 1 6 and 2 in that order
frostnew <- frost[,c(1,6,2)]

#print the first 2 rows.
frostnew[1:2, ]

or the names like this

# choose columns 1 6 and 2 in that order
frostnew <- frost[,c("Dist_to_Coast","State","Latitude")]

#print the first 2 rows.
frostnew[1:2, ]



6.3 Renaming columns

Good column names don’t have spaces or special characters. You can either rename in excel or using the names command e.g.

names(frostnew)
## [1] "Dist_to_Coast" "State"         "Latitude"
names(frostnew)[3]
## [1] "Latitude"
names(frostnew)[3] <- "hellloooooo"
summary(frostnew)
##  Dist_to_Coast        State            hellloooooo   
##  Min.   :  1.148   Length:76          Min.   :28.80  
##  1st Qu.: 59.186   Class :character   1st Qu.:32.76  
##  Median :128.784   Mode  :character   Median :34.23  
##  Mean   :129.437                      Mean   :33.91  
##  3rd Qu.:192.639                      3rd Qu.:35.23  
##  Max.   :298.914                      Max.   :37.33



6.4 Choosing rows/filtering

6.4A The dplyr filter command (tidyverse)

Filtering means selecting rows/observations based on their values. To filter in R, use the command filter() from the dplyr package. I tend to write it as dplyr:filter() to force it to be correct. Visually, filtering rows looks like this.

First, make sure that this library is in your library code chunk and that you have run it.

library(tidyverse)

Let’s filter our frost data.

frost <- read_excel("./Data/DataG364_frostday.xlsx")
names(frost)
## [1] "Station"             "State"               "Type_Fake"          
## [4] "Avg_DOY_SpringFrost" "Latitude"            "Longitude"          
## [7] "Elevation"           "Dist_to_Coast"
head(frost)

Here we can apply the filter command to choose specific rows that meet certain criteria. There are a few filter commands out there, so I’m specifying I want the dplyr one.

dplyr::filter(frost, State == "FL")
  • The double equal operator == means equal to. The command is telling R to keep the rows in frost where the State column equals “FL”.

  • Note, there’s no <- so I’m printing the results out on the screen. If you want to save your result, put answer <- dplyr::filter(frost, State == "FL")



Several categories

If you want a few categories, choose the %in% operator, using the c() command to stick together the categories you want. For example, here are states in Florida and Virginia.

filter(frost, State %in% c("FL","VA"))

We can also explicitly exclude cases and keep everything else by using the not equal operator !=. The following code excludes airport stations.

filter(frost, Type_Fake != "Airport")

Greater or lesser

What about filtering if a row has a value greater than a specified value? For example, Stations with an elevation greater than 500 feet?

filter(frost, Elevation > 500)

Or less-than-or-equal-to 200 feet.

# or save the result to a new variable
lowland_stations <- filter(frost, Elevation < 200)
summary(lowland_stations)


Complex commands

In addition to comparison operators, filtering may also utilize logical operators that make multiple selections. There are three basic logical operators: & (and), | (or), and ! (not). We can keep Stations with an Elevation greater than 300 and State in Alabama &.

filter(frost, Elevation > 300 & State == "AL")

Use | to keep Stations with a Type_Fake of “Airport” or a last spring frost date after April (~ day 90 of the year).

airportwarm <- filter(frost, 
                      Type_Fake == "Airport" | Avg_DOY_SpringFrost > 90 )

airportwarm

6.5 Sorting data

6.5A The dplyr arrange command (tidyverse)


We use the arrange() function to sort a data frame by one or more variables. You might want to do this to get a sense of which cases have the highest or lowest values in your data set or sort counties by their name. For example, let’s sort in ascending order by elevation.

arrange(frost, Latitude)

By default, arrange() sorts in ascending order. We can sort by a variable in descending order by using the desc() function on the variable we want to sort by. For example, to sort the dataframe by Avg_DOY_SpringFrost in descending order we use

arrange(frost, desc(Avg_DOY_SpringFrost))