This tutorial covers loads of summary statistics that might be useful.

General pointers

Here’s the structure of EVERY command in R.

output_variable  <- COMMAND( input_variable , other_options)

COMMAND is the function used to read the data
- e.g.,read.csv for CSV files
input_variable" should be replaced with the case sensitive R variable you want to apply it to. Look at your environment tab to see what you have saved.
other_options includes (optional) additional options for your command
- e.g. nrow specifies the number of rows within the read.csv() command
"output_variable" is the name of the variable you choose to assign your data to.
- e.g x <- 5 assigns the number 5 to a variable called x
  Then x <- 5+x, will first calculate the answer (5+5), then OVERWRITE the x with the answer e.g x = 10

7. Summarising data

Here I will show a few examples for the a non spatial dataset on houseprices.

First I load the data here

data("HousesNY", package = "Stat2Data")

7.1. Looking at the data itself

To have a look at the data there are many options. You can:

click on its name in the environment tab
Type its name into the console or into a code chunk (e.g. for our table, type piratedataset into the console or a code chunk)
Run the command View(variable_name) (View is a command from the tidyverse package).
This will open the data in a new tab. DON’T PUT THIS IN A CODE CHUNK.
Run the command head(variable_name) to see the first 6 lines or so (good for quick checks)
Run the command glimpse(variable_name) to get a nice summary.
Run the command names(variable_name) to get the column names.

DO NOT PUT View(dataname) into a code chunk (or remove it before you knit). It breaks R-studio sometimes

For example

# Note, there are sometimes more columns to the right, use the arrow to see
head(HousesNY)

To see what the column names are, you can use the names(dataset) command. I use this in the console A LOT for copy/pasting names into my code/report.

names(HousesNY)

## [1] "Price" "Beds"  "Baths" "Size"  "Lot"

Or the glimpse command:

glimpse(HousesNY)

7.2. Number of rows and columns

To find the number of rows and columns, these are useful. Or look at the environment tab, or some summaries include it

nrow(HousesNY)
ncol(HousesNY)

#or both dimensions
dim(HousesNY)

7.3. Finding the type of data

To see what type of data R thinks you have, try the class command

class(HousesNY)

or for a column

class(HousesNY$Price)

7.4. Single column statistics

Or you can do things manually, using the $ symbol to choose a column. All of this is for the price column

mean(HousesNY$Price)
median(HousesNY$Price)
mode(HousesNY$Price)
sd(HousesNY$Price)
var(HousesNY$Price)
IQR(HousesNY$Price)
range(HousesNY$Price)

7.5. Useful summaries

To look at the summaries there are a load of options. Choose your favourites:

summary(dataset)
skim(dataset) in the skimr package
summarize(dataset) in the papeR package. This looks pretty powerful, I’m just learning it

None are better or worse than others - simply choose what works for you in the moment.

summary(HousesNY)

library(skimr) # you would need to install this
skim(HousesNY)

library(pillar) # you would need to install this
glimpse(HousesNY)

str(HousesNY)

7.6 “Group_by” Statistics per group

What if you want to find more sophisticated statistics e.g. the avergae price per size of house.

Here we use the group_by() and summarise() commands and save our answers to a new variable.

We are making use of the pipe symbol, %>%, which takes the answer from group_by and sends it directly to the summarise command

Here is some data on frost dates at weather stations (i’ll update on house data later)

frost    <- readxl::read_excel("./Data/DataG364_frostday.xlsx")
head(frost)

To summarise results by the type of weather station:

frost.summary.type <- group_by(frost, by=Type_Fake) %>%
                          summarise(mean(Latitude),
                                    max(Latitude),
                                    min(Dist_to_Coast))
frost.summary.type

Here, my code is:

Splitting up the frost data by the Type_Fake column
(e.g. one group for City, one for Airport and one for Agricultural Research)
For the data rows in each group, calculating the mean latitude, the maximum latitude and the minimum distance to the coast
Saving the result to a new variable called frost.summary.type.
Printing the results on the screen e.g. the furthest North/maximum latitude of rows tagged Agricultural_Research_Station is 36.32 degrees.

Tutorial 7: Summarising Data