To see how many columns and rows there are, you can use the nrow() and ncol() commands
nrow(HousesNY)
[1] 53
ncol(HousesNY)
[1] 5
Summary statistics
To look at the summaries there are a load of options. Choose your favourites:
summary(dataset)
skim(dataset) in the skimr package
summarize(dataset) in the papeR package. This looks pretty powerful, I’m just learning it
None are better or worse than others - simply choose what works for you in the moment.
summary(HousesNY)
Price Beds Baths Size
Min. : 38.5 Min. :2.000 Min. :1.000 Min. :0.712
1st Qu.: 82.7 1st Qu.:3.000 1st Qu.:1.500 1st Qu.:1.296
Median :107.0 Median :3.000 Median :2.000 Median :1.528
Mean :113.6 Mean :3.396 Mean :1.858 Mean :1.678
3rd Qu.:141.0 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:2.060
Max. :197.5 Max. :6.000 Max. :3.500 Max. :3.100
Lot
Min. :0.0000
1st Qu.:0.2700
Median :0.4200
Mean :0.7985
3rd Qu.:1.1000
Max. :3.5000
library(skimr) # you would need to install thisskim(HousesNY)
Data summary
Name
HousesNY
Number of rows
53
Number of columns
5
_______________________
Column type frequency:
numeric
5
________________________
Group variables
None
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Price
0
1
113.63
41.43
38.50
82.70
107.00
141.00
197.5
▃▇▅▃▃
Beds
0
1
3.40
0.79
2.00
3.00
3.00
4.00
6.0
▂▇▆▁▁
Baths
0
1
1.86
0.65
1.00
1.50
2.00
2.00
3.5
▅▇▁▁▁
Size
0
1
1.68
0.60
0.71
1.30
1.53
2.06
3.1
▃▇▅▂▂
Lot
0
1
0.80
0.76
0.00
0.27
0.42
1.10
3.5
▇▃▂▁▁
library(pillar) # you would need to install this
Attaching package: 'pillar'
The following object is masked from 'package:dplyr':
dim_desc
Or you can do things manually, using the $ symbol to choose a column. All of this is for the price column
mean(HousesNY$Price)
[1] 113.6321
median(HousesNY$Price)
[1] 107
mode(HousesNY$Price)
[1] "numeric"
sd(HousesNY$Price)
[1] 41.43006
var(HousesNY$Price)
[1] 1716.45
IQR(HousesNY$Price)
[1] 58.3
range(HousesNY$Price)
[1] 38.5 197.5
Missing data and NAs.
There are missing values in some datasets - and by default, R will set the answer to statistics to also be missing.
# Create some test datatest <-data.frame(A=c(1,3,4),B=c(NA,3,1))print(test)
A B
1 1 NA
2 3 3
3 4 1
# Take the mean of column Bmean(test$B)
[1] NA
# Take the correlation between A and Bcor(test$A,test$B)
[1] NA
To ignore them in a given command, try adding ,na.rm=TRUE to the command e.g.
mean(test$B, na.rm=TRUE)
[1] 2
or sometimes I look in the help file (or on google) for a slightly different terminology.
#the cor command doesn't follow the pattern.cor(test$A,test$B,use ="complete.obs")
[1] -1
To simply remove all rows with missing data, try the na.omit() command e g.
test2 <-na.omit(test)test2
A B
2 3 3
3 4 1
mean(test2$B)
[1] 2
Making frequency tables
Sometimes we want to see how many rows there are in different categories. The easiest way to do this is using the table command. For example, in our New York data, we can see how many houses there are with each number of beds using
table(HousesNY$Beds)
2 3 4 5 6
5 26 19 2 1
So there are 19 rows in our dataset where the Beds column says 4 (AKA 19 houses in our sample with 4 beds). Or we can look at a 2 dimensional table
To make these look more professional there are a number of packages you can install and use. For example, ztable will take the output of table and format it in a pretty way. This will look TERRIBLE when you run R as it’s making html code. But when you press knit it will look beautiful
# don't include the install line in your code, run it in the console# install.package("ztable")library(ztable)
Welcome to package ztable ver 0.2.3
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
Here I am using the NYHouses data as an example. Sometimes we want to deal with only one specific column in our spreadsheet/dataframe, for example applying the mean/standard deviation/inter-quartile range command to say just the Price column.
To do this, we use the $ symbol. For example, here I’m simply selecting the data in the elevation column only and saving it to a new variable called elevationdata.
data("HousesNY")price <- HousesNY$Priceprice
Try it yourself. You should have seen that as you typed the $, it gave you all the available column names to choose from.
This means we can now easily summarise specific columns. For example:
summary(HousesNY) will create a summary of the whole spreadsheet,
summary(HousesNY$Price) will only summarise the Price column.
mean(HousesNY$Price) will take the mean of the Price column in the HousesNY dataframe.
Table command: counts per group
Sometimes we want to count the occurrences of some category in our dataset. For example, if you look at the HousesNY, it might be interesting to know how many Houses had each number of bedrooms
To do this, we use the table command:
table(HousesNY$Beds)
or to see the number with each combination of bedrooms and bathrooms:
table(HousesNY$Beds, HousesNY$Baths)
For more, this tutorial is excellent: https://www.cyclismo.org/tutorial/R/tables.html.
“Group_by” command: statistics per group
What if we want to do more than just count the number of rows?
Well, we can use the group_by() and summarise() commands and save our answers to a new variable.
Here we are making use of the pipe symbol, %>%, which takes the answer from group_by and sends it directly to the summarise command.
Here is some data on frost dates at weather stations.
# A tibble: 6 × 8
Station State Type_Fake Avg_DOY_SpringFrost Latitude Longitude Elevation
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Valley AL City 110. 34.6 -85.6 1020
2 Union AL City 82.3 32.0 -85.8 440
3 Saint AL Airport 99.8 34.2 -86.8 800
4 Fernandina FL City 46.9 30.7 -81.5 13
5 Lake FL City 60.6 30.2 -82.6 195
6 West GA City 85.6 32.9 -85.2 575
# ℹ 1 more variable: Dist_to_Coast <dbl>
To summarise results by the type of weather station:
# A tibble: 3 × 4
by `mean(Latitude)` `max(Latitude)` `min(Dist_to_Coast)`
<chr> <dbl> <dbl> <dbl>
1 Agricultural_Research_S… 33.7 36.3 4.95
2 Airport 34.4 37.3 45.4
3 City 33.7 36.5 1.15
Here, my code is:
Splitting up the frost data by the Type_Fake column (e.g. one group for City, one for Airport and one for Agricultural Research)
For the data rows in each group, calculating the mean latitude, the maximum latitude and the minimum distance to the coast
Saving the result to a new variable called frost.summary.type.
Printing the results on the screen e.g. the furthest North/maximum latitude of rows tagged Agricultural_Research_Station is 36.32 degrees.
Filtering rows and columns
Sometimes, we do not want to analyse at the entire data.frame. Instead, we would like to only look at one (or more) columns or rows.
There are several ways we can select data.
To choose a specific column, we can use the $ symbol to select its name (as described above)
If you know which number rows or columns you want, you can use square brackets to numerically select data.
Essentially our data follows the format: tablename[ROWS,COLUMNS]
# This will select the data in the 5th row and 7th columnfrost[5,7]
# A tibble: 1 × 1
Elevation
<dbl>
1 195
# This will select the 2nd row and ALL the columns frost[2,]
# A tibble: 1 × 8
Station State Type_Fake Avg_DOY_SpringFrost Latitude Longitude Elevation
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Union AL City 82.3 32.0 -85.8 440
# ℹ 1 more variable: Dist_to_Coast <dbl>
# This will select the 3rd column and ALL the rowsfrost[,3]
# A tibble: 76 × 1
Type_Fake
<chr>
1 City
2 City
3 Airport
4 City
5 City
6 City
7 City
8 City
9 Agricultural_Research_Station
10 Agricultural_Research_Station
# ℹ 66 more rows
# We can combine our commands, this will print the 13th row of the Longitude column # (no comma as we're only looking at one column)frost$Longitude[13]
[1] -82.58
# The : symbol lets you choose a sequence of numbers e.g. 1:5 is 1 2 3 4 5# So this prints out rows 11 to 15 and all the columnsfrost[11:15,]
# A tibble: 5 × 8
Station State Type_Fake Avg_DOY_SpringFrost Latitude Longitude Elevation
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Winthrop SC Agricultural_… 87.2 34.9 -81.0 690
2 Little SC Airport 87.7 34.2 -81.4 711
3 Calhoun SC Airport 91.5 34.1 -82.6 530
4 Clemson SC City 93.6 34.7 -82.8 824
5 De FL City 71.3 30.7 -86.1 245
# ℹ 1 more variable: Dist_to_Coast <dbl>
# The "c" command allows you to enter whatever numbers you like. # So this will print out rows 4,3,7 and the "Elevation" and "Dist_to_Coast" columnsfrost[c(4,3,7), c("Elevation","Dist_to_Coast")]
You can also add questions and commands inside the square brackets. For example here is the weather station with the lowest elevation. You can see my command chose BOTH rows where elevation = 10.
# which row has the lowest elevation# note the double == (more below)row_number <-which(frost$Elevation ==min(frost$Elevation))# choose that rowloweststtation <- frost[row_number , ]loweststtation
# A tibble: 2 × 8
Station State Type_Fake Avg_DOY_SpringFrost Latitude Longitude Elevation
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Charlestown SC Agricultur… 84.6 32.8 -79.9 10
2 Edenton NC City 85.3 36.0 -76.6 10
# ℹ 1 more variable: Dist_to_Coast <dbl>
# A tibble: 5 × 8
Station State Type_Fake Avg_DOY_SpringFrost Latitude Longitude Elevation
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Fernandina FL City 46.9 30.7 -81.5 13
2 Charlestown SC Agricultur… 84.6 32.8 -79.9 10
3 Edenton NC City 85.3 36.0 -76.6 10
4 Southport NC City 82.8 33.9 -78.0 20
5 Brunswick GA Agricultur… 48.4 31.2 -81.5 13
# ℹ 1 more variable: Dist_to_Coast <dbl>
which command
The which command essentially says “which numbers” meet a certain threshold
e,g,
a <-100:110which(a >107)
[1] 9 10 11
Or which rows:
outlier_rows <-which(frost$Dist_to_Coast <1.5)
Deleting rows
Or if you know the row number you can use the minus - sign to remove. Or just filter below.
# remove row 6 and and overwritefrost <- frost[-6 ,]# remove columns 4 and 2 and save result to newdata and overwritenewdata <- frost[, -c(2,4) ]
The dplyr filter command (tidyverse)
Filtering means selecting rows/observations based on their values. To filter in R, use the command filter() from the dplyr package. I tend to write it as dplyr:filter() to force it to be correct.
Here we can apply the filter command to choose specific rows that meet certain criteria
filter(frost, State =="FL")
The double equal operator == means equal to. The command is telling R to keep the rows in frost where the State column equals “FL”.
If you want a few categories, choose the %in% operator, using the c() command to stick together the categories you want. For example, here are states in Florida and Virginia.
filter(frost, State %in%c("FL","VA"))
We can also explicitly exclude cases and keep everything else by using the not equal operator !=. The following code excludes airport stations.
filter(frost, Type_Fake !="Airport")
What about filtering if a row has a value greater than a specified value? For example, Stations with an elevation greater than 500 feet?
filter(frost, Elevation >500)
Or less-than-or-equal-to 200 feet.
# or save the result to a new variablelowland_stations <-filter(frost, Elevation <200)summary(lowland_stations)
In addition to comparison operators, filtering may also utilize logical operators that make multiple selections. There are three basic logical operators: & (and), | (or), and ! (not). We can keep Stations with an Elevation greater than 300 andState in Alabama &.
filter(frost, Elevation >300& State =="AL")
Use | to keep Stations with a Type_Fake of “Airport” or a last spring frost date after April (~ day 90 of the year).
We use the arrange() function to sort a data frame by one or more variables. You might want to do this to get a sense of which cases have the highest or lowest values in your data set or sort counties by their name. For example, let’s sort in ascending order by elevation.
arrange(frost, Latitude)
# A tibble: 75 × 8
Station State Type_Fake Avg_DOY_SpringFrost Latitude Longitude Elevation
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Inverness FL City 50.6 28.8 -82.3 40
2 Ocala FL City 52.7 29.2 -82.1 75
3 Lake FL City 60.6 30.2 -82.6 195
4 Tallahassee FL Agricultu… 75.8 30.4 -84.4 55
5 Fernandina FL City 46.9 30.7 -81.5 13
6 De FL City 71.3 30.7 -86.1 245
7 Quitman GA City 65.5 30.8 -83.6 185
8 Brunswick GA Agricultu… 48.4 31.2 -81.5 13
9 Waycross GA Agricultu… 75.9 31.2 -82.3 145
10 Tifton GA City 87.3 31.4 -83.5 380
# ℹ 65 more rows
# ℹ 1 more variable: Dist_to_Coast <dbl>
By default, arrange() sorts in ascending order. We can sort by a variable in descending order by using the desc() function on the variable we want to sort by. For example, to sort the dataframe by Avg_DOY_SpringFrost in descending order we use
arrange(frost, desc(Avg_DOY_SpringFrost))
# A tibble: 75 × 8
Station State Type_Fake Avg_DOY_SpringFrost Latitude Longitude Elevation
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Marshall NC Airport 118. 35.8 -82.7 2000
2 Highlands NC Agricult… 118. 35.0 -83.2 3333
3 Mt NC City 113. 36.5 -80.6 1041
4 Louisburg NC City 113. 36.1 -78.3 260
5 Rocky VA Airport 111. 37.0 -79.9 1315
6 Henderson NC Agricult… 111. 36.3 -78.4 512
7 Farmville VA Airport 111. 37.3 -78.4 450
8 Statesville NC City 110. 35.8 -80.9 951
9 Valley AL City 110. 34.6 -85.6 1020
10 Hendersonvi… NC Agricult… 110. 35.3 -82.4 2160
# ℹ 65 more rows
# ℹ 1 more variable: Dist_to_Coast <dbl>