This tutorial covers subsetting, filtering, removing missing values..
The na.omit command will remove any row with ANY missing value. Note I’m overwriting my variable with the smaller one. It’s up to you if you want to do this or make a new variable
If you want to only remove rows with missing values in a single column, we can use the complete cases command
E.g. take the frost table and ONLY include values where the longitude is not missing. Note, this can be any column name at all.
Sometimes, you have -999 or something as NA. You can apply a RULE using the filter command below or like this to make them actually NAs
E.g. here I select all the distance to coasts that are less than 0 (e.g impossible) and set them to NA.
Sometimes you don’t want all the columns. You can subset in a few ways.
For example, here are the frost data columns
## [1] "Station" "State" "Type_Fake"
## [4] "Avg_DOY_SpringFrost" "Latitude" "Longitude"
## [7] "Elevation" "Dist_to_Coast"
I can choose which column NUMBERS I want like this
# choose columns 1 6 and 2 in that order
frostnew <- frost[,c(1,6,2)]
#print the first 2 rows.
frostnew[1:2, ]
or the names like this
# choose columns 1 6 and 2 in that order
frostnew <- frost[,c("Dist_to_Coast","State","Latitude")]
#print the first 2 rows.
frostnew[1:2, ]
Good column names don’t have spaces or special characters. You can either rename in excel or using the names command e.g.
## [1] "Dist_to_Coast" "State" "Latitude"
## [1] "Latitude"
## Dist_to_Coast State hellloooooo
## Min. : 1.148 Length:76 Min. :28.80
## 1st Qu.: 59.186 Class :character 1st Qu.:32.76
## Median :128.784 Mode :character Median :34.23
## Mean :129.437 Mean :33.91
## 3rd Qu.:192.639 3rd Qu.:35.23
## Max. :298.914 Max. :37.33
Filtering means selecting rows/observations based on their values. To
filter in R, use the command filter()
from the dplyr
package. I tend to write it as dplyr:filter()
to force it
to be correct. Visually, filtering rows looks like this.
First, make sure that this library is in your library code chunk and that you have run it.
Let’s filter our frost data.
## [1] "Station" "State" "Type_Fake"
## [4] "Avg_DOY_SpringFrost" "Latitude" "Longitude"
## [7] "Elevation" "Dist_to_Coast"
Here we can apply the filter command to choose specific rows that meet certain criteria. There are a few filter commands out there, so I’m specifying I want the dplyr one.
The double equal operator ==
means equal to. The
command is telling R to keep the rows in frost where the
State column equals “FL”.
Note, there’s no <- so I’m printing the results out on the
screen. If you want to save your result, put
answer <- dplyr::filter(frost, State == "FL")
If you want a few categories, choose the %in% operator, using the
c()
command to stick together the categories you want. For
example, here are states in Florida and Virginia.
We can also explicitly exclude cases and keep everything else by
using the not equal operator !=
. The following code
excludes airport stations.
What about filtering if a row has a value greater than a specified value? For example, Stations with an elevation greater than 500 feet?
Or less-than-or-equal-to 200 feet.
# or save the result to a new variable
lowland_stations <- filter(frost, Elevation < 200)
summary(lowland_stations)
In addition to comparison operators, filtering may also utilize
logical operators that make multiple selections. There are three basic
logical operators: &
(and), |
(or), and
!
(not). We can keep Stations with an Elevation
greater than 300 and State in Alabama
&
.
Use |
to keep Stations with a Type_Fake of
“Airport” or a last spring frost date after April (~
day 90 of the year).
We use the arrange()
function to sort a data frame by
one or more variables. You might want to do this to get a sense of which
cases have the highest or lowest values in your data set or sort
counties by their name. For example, let’s sort in ascending order by
elevation.
By default, arrange()
sorts in ascending order. We can
sort by a variable in descending order by using the desc()
function on the variable we want to sort by. For example, to sort the
dataframe by Avg_DOY_SpringFrost in descending order we use