This tutorial covers loads of summary statistics that might be useful.


General pointers

Here’s the structure of EVERY command in R.

output_variable  <- COMMAND( input_variable , other_options)
  • COMMAND is the function used to read the data
    • e.g.,read.csv for CSV files

  • input_variable" should be replaced with the case sensitive R variable you want to apply it to. Look at your environment tab to see what you have saved.

  • other_options includes (optional) additional options for your command
    • e.g. nrow specifies the number of rows within the read.csv() command

  • "output_variable" is the name of the variable you choose to assign your data to.
    • e.g x <- 5 assigns the number 5 to a variable called x
      Then x <- 5+x, will first calculate the answer (5+5), then OVERWRITE the x with the answer e.g x = 10



7. Summarising data

Here I will show a few examples for the a non spatial dataset on houseprices.

First I load the data here

data("HousesNY", package = "Stat2Data")



7.1. Looking at the data itself

To have a look at the data there are many options. You can:

  • click on its name in the environment tab
  • Type its name into the console or into a code chunk (e.g. for our table, type piratedataset into the console or a code chunk)
  • Run the command View(variable_name) (View is a command from the tidyverse package).
    This will open the data in a new tab. DON’T PUT THIS IN A CODE CHUNK.
  • Run the command head(variable_name) to see the first 6 lines or so (good for quick checks)
  • Run the command glimpse(variable_name) to get a nice summary.
  • Run the command names(variable_name) to get the column names.

DO NOT PUT View(dataname) into a code chunk (or remove it before you knit). It breaks R-studio sometimes


For example

# Note, there are sometimes more columns to the right, use the arrow to see
head(HousesNY)

To see what the column names are, you can use the names(dataset) command. I use this in the console A LOT for copy/pasting names into my code/report.

names(HousesNY)
## [1] "Price" "Beds"  "Baths" "Size"  "Lot"

Or the glimpse command:

glimpse(HousesNY)



7.2. Number of rows and columns

To find the number of rows and columns, these are useful. Or look at the environment tab, or some summaries include it

nrow(HousesNY)
ncol(HousesNY)
#or both dimensions
dim(HousesNY)

7.3. Finding the type of data

To see what type of data R thinks you have, try the class command

class(HousesNY)

or for a column

class(HousesNY$Price)



7.4. Single column statistics

Or you can do things manually, using the $ symbol to choose a column. All of this is for the price column

mean(HousesNY$Price)
median(HousesNY$Price)
mode(HousesNY$Price)
sd(HousesNY$Price)
var(HousesNY$Price)
IQR(HousesNY$Price)
range(HousesNY$Price)



7.5. Useful summaries

To look at the summaries there are a load of options. Choose your favourites:

  • summary(dataset)
  • skim(dataset) in the skimr package
  • summarize(dataset) in the papeR package. This looks pretty powerful, I’m just learning it

None are better or worse than others - simply choose what works for you in the moment.

summary(HousesNY)
library(skimr) # you would need to install this
skim(HousesNY)
library(pillar) # you would need to install this
glimpse(HousesNY)

or

str(HousesNY)

7.6 “Group_by” Statistics per group

What if you want to find more sophisticated statistics e.g. the avergae price per size of house.

Here we use the group_by() and summarise() commands and save our answers to a new variable.

We are making use of the pipe symbol, %>%, which takes the answer from group_by and sends it directly to the summarise command

Here is some data on frost dates at weather stations (i’ll update on house data later)

frost    <- readxl::read_excel("./Data/DataG364_frostday.xlsx")
head(frost)

To summarise results by the type of weather station:

frost.summary.type <- group_by(frost, by=Type_Fake) %>%
                          summarise(mean(Latitude),
                                    max(Latitude),
                                    min(Dist_to_Coast))
frost.summary.type

Here, my code is:

  • Splitting up the frost data by the Type_Fake column
    (e.g. one group for City, one for Airport and one for Agricultural Research)
  • For the data rows in each group, calculating the mean latitude, the maximum latitude and the minimum distance to the coast
  • Saving the result to a new variable called frost.summary.type.
  • Printing the results on the screen e.g. the furthest North/maximum latitude of rows tagged Agricultural_Research_Station is 36.32 degrees.