Chapter 1 Tutorials

1.1 Tutorial 1A: Installing R and R-studio

In this class, we would like you to download and install BOTH R and R-Studio onto your own computers. You should have already completed this step for Homework 1. If you have not, see instructions here: https://psu.instructure.com/courses/2115020/assignments/12682456

FOR WINDOWS USERS. Sometimes R asks you to download something called RTools. You can do so here: https://cran.r-project.org/bin/windows/Rtools/ Follow the instructions closely and ask if you need support.

1.1.1 Open R-studio

Icons of the R program and R-studio

Figure 1.1: Icons of the R program and R-studio

Open R-studio (NOT R!). You should be greeted by three panels:

  • The interactive R console (entire left)
  • Environment/History (tabbed in upper right)
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)

If you click on the View/Panes/Pane Layout menu item, you can move these quadrants around. I tend to like the console to be top left and scripts to be top right, with the plots and environment on the bottom - but this is totally personal choice.

The basic R-studio screen when you log on

Figure 1.2: The basic R-studio screen when you log on

If you wish to learn more about what these windows do, have a look at this resource, from the Pirates Guide to R: https://bookdown.org/ndphillips/YaRrr/the-four-rstudio-windows.html

If you have used R before, you might see that there are variables and plots etc already loaded. It is always good to clear these before you start a new analysis. To do this, click the little broom symbol in your environment tab.

1.1.2 Change a few settings

R-studio wants to be helpful and will try to re-load exactly where you were in a project when you log back in. This can get really confusing, so we are going to turn this off.

ON A MAC: Click on the R-studio menu button on the top left of the screen, then click Preferences.

ON A PC: Click on Tools-> Global Options -> Preferences

Now:

  • UNCLICK “Restore most recently opened project at startup”
  • Set “Save workspace to .RData on” exit to Never

1.2 Tutorial 1B: Creating and loading R projects

1.2.1 What are projects?

Rproject2

One of the most useful new features in RStudio are R-Projects. We are going to use R-projects to store our lab data in. An R project is a place to store all your commands, data and output in the same place.

Before R projects, we had to worry about where on your computer your data was saved in. Now, we can force R-Studio to immediately look inside your lab folder - making life much easier.

Equally, projects allow you to have multiple versions of R-studio open. This is useful because say you have Lab-2 open, but want to re-open Lab-1 to check some code, clicking the “Lab 1 project file” will load a whole new version of R with Lab 1 ready to edit.

It is really important that you stay organized in this course, for example making sure you know where you put your data and your labs.

To encourage this, choose a place that makes sense on your computer, create a folder called STAT-462. can either do this outside R studio in your file explorer, or inside the “Files tab” in R studio. You should store all your 462 labs inside this folder

Now we will create our first project (also see the pic):

  • open R-Studio and click New Project,
  • then click “new directory”
  • then “new project”.
  • Name your project Lab 1, then browse to your newly created STAT-462 folder and press open.
  • Press OK and your project is created.
    • You will also see that a new sub-folder has been created in your STAT-462 folder called Lab 1
    • Inside that is your Project file. (with a .Rproj extension)
Instructions to create an R project

Figure 1.3: Instructions to create an R project

Your output should look something like this:

What you should see

Figure 1.4: What you should see

Equally, R should now be “looking” inside your Lab 1 folder, making it easier to find your data and output your results. Try typing this into the console (INCLUDING THE EMPTY PARANTHESES/BRACKETS) and see if it prints out the location of Lab 1 on your computer. If not, talk to an instructor.

getwd()

In the future, every time you want to work on Lab 1, rather than open R-studio directly, double click the R project file inside Lab 1 and you will get back to your work.

If you’re having issues at this point or haven’t managed to get to this step, STOP! Ask an instructor for help.

Now you have opened your project, take a screenshot of your R-studio page. It should look like Figure 1.4, e.g. with at least R version 4.0.3, with the Lab 1 project and the project stored in your STAT-462 folder.

  • To take a screenshot on a mac, press Command-3. The screenshot will appear on your desktop
  • To take a screenshot on a PC, press Alt + PrtScn

Rename the screenshot to your “username_Lab1_Fig1”(for example for me it will be hlg5155_Lab1_Fig1), then place it in your Lab 1 sub-folder inside STAT-462. This folder was created when you made the project.

You will need this later, so don’t skip this step.

1.3 Tutorial 1C: R coding basics

Rbasics

So now we FINALLY get to do some R-coding. First things first, first watch the 5 min video above for some pointers. Will will also go through this below:

1.3.1 Using R as a calculator

Remember that the aim of R is to type in commands to get your computer to analyse data. The console (see Figure 1.2) is a space where you can type in those commands and it will directly print out the answer. You’re essentially talking to the computer.

The little “>” symbol in the console means that the computer is waiting for your command.

Let’s start by the simplest command possible. Try typing each of the following commands into your R console and pressing Enter as you work through this.

1+1
## [1] 2

Note that spacing does not matter: 1+1 will generate the same answer as 1 + 1.

When using R as a calculator, the order of operations is the same as you would have learned back in school, so use brackets to force a different order. For example,

3 + 5 * 2
## [1] 13

will give a different result to

(3 + 5) * 2
## [1] 16

1.3.2 Adding text

Now for text. Can you say hello world?

Your screen after running the project

Figure 1.5: Your screen after running the project

Nope, there is an error! To make R understand text, it is important to use quote marks.

print("Hello World")
## [1] "Hello World"

1.3.3 Comparisons

We can also do comparisons in R - using the special symbols TRUE or FALSE (no quote marks, they are special).

Here we are asking R whether 1 is equal to 1.

# note two equals signs is read as "is equal to"
1 == 1  
## [1] TRUE

We could also have used

  • != “Not equal to”
  • < “Less than”
  • <= "Less than or equal to`
  • > “Greater than”
  • >= “Greater than or equal to”

1.3.4 What if I press Enter too soon?

If you type in an incomplete command, R will wait for you to complete it. For example, if you type 1 + and press enter, R will know that you need to complete the command So it will move onto the next line but the > will have changed into a +, which means its waiting for you to complete the command.

If you want to cancel a command you can simply hit the “Esc” key or press the little stop symbol and R studio will reset.

Pressing escape isn’t only useful for killing incomplete commands: you can also use it to tell R to stop running code (for example if it’s taking much longer than you expect), or to get rid of the code you’re currently writing.

1.3.5 Variables and assignment

It’s great to be able to do maths easily on the screen, but really we want to be able to save our results, or load in data so we can run more complex commands. In R, we can give our data a name e.g. we save our data as a variable. So then, instead of typing the whole command, we can simply type the variable’s name and R will recall the answer.

The symbol to store data into a variable is using the assignment arrow <-, which is made up of the left arrow and a dash. You can also use the equals sign, but it can cause complications later on. Try typing this command into the console:

x <- 1/50

Notice that pressing enter did not print a value onto your screen as it did earlier. Instead, we stored it for later in something called a variable, with the name ‘x’.

So our variable x is now associated with the value 0.02, or 1/50. You can print a variable on screen by typing its name, no quotes, or by using the print command. Try printing out your variable.

x
## [1] 0.02
# or

print(x)
## [1] 0.02

Look for the Environment tab in one of the panes of RStudio, and you will see that ‘x’ and its value have appeared. This ‘x’ variable can be used in place of a number in any calculation that expects a number. Try typing

log(x)
## [1] -3.912023

Notice also that variables can be reassigned:

x <- 100
print(x)
## [1] 100

x used to contain the value 0.025 and and now it has the value 100.

Note, the letter x isn’t special in any way, it’s just a variable name. You can replace it with any word you like as long as it contains no spaces and doesn’t begin with a number. Different people use different conventions for long variable names, these include

  • periods.between.words.1
  • underscores_between_words
  • camelCaseToSeparateWords

What you use is up to you, but be consistent.

Finally, R IS CASE SENSITIVE. X and x are different variables!

h <- 1
H <- 2

ans <- h+H
print(ans)
## [1] 3

1.3.5.1 Combining variables

As I showed above, you can now use multiple variables together in more complex commands. For example, try these commands:

#Take the variable x, add 1 then save it to a new variable called y
y <- x + 1 
# print the multiple of 2yx onto the screen
y
## [1] 101

Now you can see that there are two variables in your environment tab, x and y. Where y is the sum of the contents of x plus 1.

The way R works is that first it looks for the commands on the right of the arrow. It runs all of them, calculates the result, then saves that result with the name on the left of the arrow. It does not save the command itself, just the answer. For example, in this case, R has no idea that y was created using a sum, it just knows that it is equal to the number 3.

You can even use this to change your original variable . Try typing the code below in a few times into the console and see what happens.

A short cut to do this is to type the commands the first time, then use the up-arrow on your keyboard to cycle back through previous commands you have typed

x <- x + 1 # notice how RStudio updates its description of x in the environment tab
x          # print the contents of "x" onto the screen
## [1] 101

Our variables don’t have to be numbers. They could refer to tables of data, or a spatial map, or any other complex thing. We will cover this more in future labs.

1.3.6 Functions

A command is simply an action you can take - like pressing the square root button on a calculator, followed by the number you wish to take the square root of. A command is always followed by parentheses ( ), inside which you put your arguments.

The power of R lies in its many thousands of these built in commands, or functions. In fact, we have already come across one - the print command. Some more examples include:

  • plot(x=1:10,y=1:10)
    • This will plot the numbers 1 to 10 against 1 to 10
  • x <- nchar("hello")
    • This will count the number of letters in the word “hello” (e.g. 5), then save it as a variable called x

Watch this short video to learn three important facts about functions:

Function basics

One example of a function is file.choose() (not how I put the parentheses in this lab book so you can see it is a command). This command will let you interactively select a file and print the address out onto the screen.

Try each of these out in your console for the file.choose() command, leaving the parentheses blank.

# Typing this into the console will print out the underlying code
file.choose 

# Typing it WITH parentheses will run the command.  
file.choose()

# Typing a ? in front will open the help file for that command
?file.choose

Sometimes we need to give the command some additional information. Anything we wish to tell the command should be included inside the inside the parentheses (separated by commas). The command will literally only know about the stuff inside the parentheses.

sin(1) # trigonometry functions.  Apply the sine function to the number 1. 
## [1] 0.841471
log(10) # natural logarithm.  Take the natural logarithm of the number 10. 
## [1] 2.302585

This following command will plot the number 1 to 10 against the numbers 12 to 20, along with some axis labels. When you run this, the plot will show up in the plots tab.

# plot the numbers 1 to 10 against the numbers 11 to 20
plot(1:10,11:20,col="dark blue", xlab="x values",ylab="STAT-462 is the best") 

If you are feeling lost, https://swcarpentry.github.io/r-novice-gapminder/01-rstudio-intro/ is a really good website which goes over a lot of this in more detail. A lot of this is based on their work.

1.4 Tutorial 1D: Packages

There are now several million commands/functions available for you to use in R. To make sure your computer doesn’t get overwhelmed, it doesn’t load all of these at once. In fact many need to be downloaded from the internet.

So we have

  • R: The programming language itself
  • Functions: Specific commands or actions written in the R language
  • Packages: Commands are grouped into bundles/apps called packages, which we download from the internet and load every time we need them.

A close analogy is your phone. There are millions of apps available from banking, to social media to camera filters. You don’t have every app in the world installed on your phone - and you don’t have every app you do download running at the same time.

This is the same for R:

  • You download and install packages from the internet that you might need. This can be done by clicking the install button in the Packages tab. Or you can use the install.packages() command. You only ever need to do this once.
  • To actually run the commands in the package you need to load/run them - just in the way you tap an app to start it. This can be done using the library() command.

Now we are going to download some packages from the internet and install them. You must be connected to the internet to make this happen! In the console, type the following commands, or click the “Install” button in the packages tab (next to plots) and find the package name. If it asks if you want to install dependencies, say yes.

If R gives you warnings about rtools, ignore them or follow the instructions at the top to install R-Tools.

# COPY/PASTE THESE INTO YOUR CONSOLE AND LET THEM RUN. 
#(These are hopefully all the packages for the course, so it's a one off download)

install.packages("tidyverse") # Lots of data processing commands
install.packages("knitr")     # Helps make good output files
install.packages("rmarkdown") # Helps make good output files
install.packages("lattice")   # Makes nice plots
install.packages("RColorBrewer") # Makes nice color-scales
install.packages("ISLR")      # contains a credit dataset
install.packages("yarrr")     # contains a toy dataset about pirates
install.packages("skimr")     # Summary statistics
install.packages("Stat2Data") # Regression specific commands
install.packages("olsrr")     # Regression specific commands
install.packages("nortest")   # Regression specific commands
install.packages("lmtest")    # Regression specific commands
install.packages("IMTest")    # Regression specific commands 
install.packages("MASS")      # Regression specific commands
install.packages("moderndive")# Regression specific commands
install.packages("corrplot")  # correlation plots
install.packages("ggpubr")    # Nice regression plots
install.packages("car")       # this one sometimes has problems, don't panic if you get errors

You will see a load of red text appear in the console as it tells you its very detailed status of how it’s downloading and installing. Don’t panic! It might also take several minutes to do this, longer on a bad internet connection. We are doing this as a one off at the start of the course.

When you have run all the commands and waited until they have finished running (remember, when it is done, you should see the little “>” symbol in the console waiting for your next command), we want to check if they have installed successfully onto your computer.

To do this we are going to load them using the library command:

library("tidyverse") # Lots of data processing commands
library("knitr")     # Helps make good output files
library("rmarkdown") # Helps make good output files
library("lattice")   # Makes nice plots
library("RColorBrewer") # Makes nice color-scales
library("ISLR")      # contains a credit dataset
library("yarrr")     # contains a toy dataset about pirates
library("skimr")     # Summary statistics
library("Stat2Data") # Regression specific commands
library("olsrr")     # Regression specific commands
library("nortest")   # Regression specific commands
library("lmtest")    # Regression specific commands
library("IMTest")    # Regression specific commands 
library("MASS")      # Regression specific commands
library("moderndive")# Regression specific commands
library("corrplot")  # correlation plots
library("ggpubr")    # Nice regression plots
library("car")       # this one sometimes has problems, don't panic if you get errors

If you have managed to install them successfully, often nothing happens - this is great! It means it loaded the package without errors.

Sometimes, it will tell you friendly messages. For example, this is what shows up when you install the tidyverse package. It is telling you the sub-packages that it downloaded and also that some commands, like filter - now have a different meaning. E.g. originally the filter command did one thing, but now the tidyverse package has made filter do something else.

Tidyverse install messages

Figure 1.6: Tidyverse install messages

To find out if what you are seeing is a friendly message or an error, run the command again. If you run it a second time and there is no error then nothing should happen.

! IMPORTANT If you see some red errors here after multiple attempts running the commands, we will have to fix what your computer is doing together. If you see errors, then take a screenshot of the full page and talk to a TA or Dr Greatrex, or post on the Lab 1 discussion.

Note - you don’t need to submit anything for lab challenge 2 - it’s here to make the rest of the semester run smoothly. Your only action is to reach out if there are errors.

1.5 Tutorial 1E: R Markdown basics

1.5.1 What is R-markdown?

So far, we’ve been typing commands into the console, but these will all be lost once you close R. Equally, it’s hard to remember or reproduce the analysis you have done . So we will now move onto writing code commands that you can save and submit.

There are several types of document that you can create and save in R-Studio.

  • A basic script (the filetype is .r). This is simply just a blank document where you can save code commands. When you “run” the commands in the script, R simply copy/pastes the commands over to the console.

  • An R-Notebook or R-Markdown document (the filetype is .Rmd). These are documents you can use to write a report with normal text/pictures, but and also include both your R code and output. You can turn these into reports, websites, blogs, presentations or applications. For example these instructions are created using a markdown document.

In this course we are going to focus on the R-Markdown format and you are going to submit your labs as html files.

1.5.2 Creating a markdown document

Time to make your own. Go to the File menu on the top left, then click New File - R-Markdown. It will ask you to name and save your file. Call it STAT-462 Lab 1.

You should see TWO new files appear in your lab 1 folder

Figure 1.7: You should see TWO new files appear in your lab 1 folder

A file should appear on your screen - your first markdown script. Essentially, we have some space for text, some space for code, and a space at the top of the file where we can add information about themes/styles etc. Each file contains some friendly text to explain what is going on, which I have annotated here.

You should see TWO new files appear in your lab 1 folder

Figure 1.8: You should see TWO new files appear in your lab 1 folder

Code chunks are the grey areas and are essentially “mini consoles”. If you click on the little right triangle arrow at the top-right of the code chunk, you can run the plot command, and a plot will appear, running the code.

Note, it no longer runs in the console. You can still copy things into the console, by clicking on the line you want to run and pressing Ctrl-Enter / command-Enter.

Let’s try this. On line 11, delete plot(cars) and type 1+1. Now press the green arrow and the answer should appear directly under your code chunk.

Now click at the end of the script, around line 20 and press the green insert button (top right in the script menu icons). Insert a new R code chunk. Inside, type 1+2 and run it. So you can see that you can have text/pictures/videos etc, and code chunks interspersed throughout it.

Now press the “knit” button (right next to the save button).

It will first ask you to save. Save it as STAT462_Lab1_PSU.ID e.g. STAT462_Lab1_hlg5155. Then, this should make a new file pop-up, a html pretty version of your code and output. If you go to your lab 1 folder, you will see that this has appeared next to your .Rmd file.

1.5.3 “Friendly text”

Much of what you see on your screen when you open a notebook document is simply a friendly introduction to RStudio.

So your notebook file is essentially a Word document with the ability to add in “mini R-consoles”, AKA your code chunks. Imagine every time you opened a new word document, it included some “friendly text” (“hi, this is a word document, this is how you do bold text, this is how you save”). This is great the first time you ever use Word, but kind of a pain after that.

RStudio actually does this. Every time you open a notebook file, you actually do get this friendly text, explaining how to use a notebook file. Read it, make sure you understand what it is telling you, then delete all of it. So delete from line 6 to the end. The stuff left over is your YAML code which tells R how to create the final html file. DON’T TOUCH THAT.

You should see TWO new files appear in your lab 1 folder

Figure 1.9: You should see TWO new files appear in your lab 1 folder

1.6 Tutorial 2A: Getting help in R

There are going to be many occasions when either you are not sure what to do in R or something is not working. Here are some of the places I get help:

1.6.1 Help on a specific command

You can type ?function_name into the console where [name] is the CASE SENSITIVE name of the command. For example, you can find help on the boxplot command by typing ?boxplot or help(boxplot).

  • The top part of each help file will show you a list of the parameters that you can use to modify the command. For example in the boxplot help file you can see that the col parameter can be used to change the boxplot colour.

  • The middle part of the help file contains details about the command for example the background or further reading

  • The last part of the help file contains worked examples that you can literally copy/paste into the console too see how they work.

If you’re sure you typed the name correct but it can’t find the help-file, then you probably first need to load the package containing the command. For example ?ggqqplot will not work before you use library("ggpubr") to load the ggpubr package.

1.6.2 Finding a command name

Let’s say that you want to run a Wilks Shapiro test but you cannot remember what the command is called. In the console try typing ??"search_text" or help.search("search_text") e.g. try ??"shapiro" or help.search("shapiro").

Note, there are now quote marks around the text you are searching for. To seek help on special operators, use quotes or back-ticks e.g. ?"<-".

help.search("shapiro") will search through every R help file on your computer that includes the word “shapiro”.

You can see that there are a three commands I can choose from (I might have more than you here). For each command you can see the package containing the command, followed by :: then the command name. Clicking on it brings up the help file.

Remember to actually use any command in R, you first need to load the relevant package. For example, if I wanted to use the nortest version of sf.test, then I would first need to include the command library(nortest) in order for it to be available.

1.6.3 Asking the internet:

This may be a cartoon but it's genuinely how I debug my code!

Figure 1.10: This may be a cartoon but it’s genuinely how I debug my code!

If your code isn’t working. Google is a great tool and you have full permission to search google to debug your code - I guarantee someone will have seen your error code before. I often literally copy/paste an error code into google and see what comes up. Or search google for things like “how to add an image R markdown”.

One good place to search is Stack Overflow. To go directly to R-related topics, visit http://stackoverflow.com/questions/tagged/r. Here’s how to ask a good question: https://stackoverflow.com/help/how-to-ask

When asking for help, typing the command sessionInfo() into the console will print out the version of R and the packages you have loaded. This can be useful for debugging any issues.

You also always have lab sessions to ask questions or the discussion boards at any time.

1.6.4 General help & cheatsheets

Within R, try running help.start() in the console (remember to include the empty parentheses or R won’t realise it’s a command). This brings up all the manuals R has available on your computer. An introduction to R is particularly useful. Click on it and have a look around.

There are also some really good cheat-sheets out there:

1.7 Tutorial 2B: Markdown formats

The aim of this tutorial is to make your markdown reports more professional and easy to read. In future weeks we will continue to expand this work so that by the end of the course, you can make highly professional interactive reports. For example, click here for all the types of document you could easily turn your report into.

1.7.1 Editing YAML code

This is the code at the top of your file (Figure 1.9). It controls how your final output looks and which type of output it becomes. When you create a markdown file, your code will look like one of these two options (one created using notebook and one markdown - either are fine)

One created using a notebook and one using a markdown

Figure 1.11: One created using a notebook and one using a markdown

No matter what you have, it is useful to change the YAML code so that it looks like this, but with your name & Lab 2.

Note, if you copy/paste this and it doesn’t work, sometimes the quote marks copy weirdly from the internet - try deleting and retyping the quotes.

---
title: "STAT-462: Lab 2"
author: "Helen Greatrex"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_float: yes
    number_sections: yes
    theme: lumen
    df_print: paged
  html_notebook:
    toc: true
    toc_float: yes
    number_sections: yes
    theme: lumen    
---

Now, save your file, or click “preview” at the top of the script. See if it works.

There is a reasonable chance this won’t work first time around, as editing the YAML code can be a pain. It is very case and space sensitive. For example, the spaces at the start of some lines are important and are created using the TAB KEY, not the space bar. There is one TAB key before html_notebook (which is now on a new line). There are two TAB KEYS before toc, toc_float, number_sections and theme.

Don’t continue until you can make and view your html or nb.html file. If it doesn’t work, ask for help before moving on

The elements we just added are:

  • The title
  • The author
  • Today’s date
  • A floating table of contents and numbered sections (this won’t appear until you start typing section headings)
  • The document is now in the lumen theme. You can choose other themes for markdown documents here. https://www.datadreaming.org/post/r-markdown-theme-gallery/

There are many more details about different options you can add here: https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf

If you want to get really fancy, there is an interesting package to help you design YAML code here: https://education.rstudio.com/blog/2019/10/tools-for-teaching-yaml-with-ymlthis/

1.7.2 Text formats

R-Markdown uses text formatting a bit like LateX rather than word where you format the text as you write. First off - there are many cheat-sheets out there, but I particularly like this one https://www.markdownguide.org/basic-syntax/.

One important way to make your R Markdown documents look more professional is to add in headings, sub-headings and text formats. Today we will look at headings and bold/italic text.

A typical Rmd file (right) and its output (left) - note you need to click on the contents bar to expand it

Figure 1.12: A typical Rmd file (right) and its output (left) - note you need to click on the contents bar to expand it

First, go to the help menu at the very top of your screen (might be hidden) and click Markdown quick reference. This will bring up a cheat-sheet.

The top menu might be hidden

Figure 1.13: The top menu might be hidden

Paragraphs and white space - READ THIS:

R is very sensitive to blank white lines. Put them everywhere. Put a blank line between paragraphs, before/after headings, before lists, before/after code chunks…. If your formatting isn’t working well, chances are there isn’t a blank line before or after it.

Headings:

Inside a code chunk, the # symbol allows you to write mini comments that help explain the command e.g. 

# calculate 1+1 and 2+2
1+1
## [1] 2
2+2
## [1] 4

Outside a code chunk, the # symbol makes a heading, AS LONG AS YOU HAVE A WHITE LINE ABOVE AND BELOW. Have a look at Figure 1.13 for an example. Including headings this way allows automatic section numbering and will allow the table of contents to automatically be created. In the script itself the headings won’t show up - you will only see them when you press knit.

Bold and italic:

Putting stars or _ around your text will make it bold or italic (or both). Again have a look at Figure 1.13 and the cheat-sheet. Interestingly (and annoyingly), there is no simple way to underline text. In the script itself the text formatting won’t show up, you will only see them when you press knit.

1.8 Tutorial 2C: Selecting & summarising data

1.8.1 Data.frame introduction

Most of the data we will look at is in “data.frame” format. This is a table, just like an excel spreadsheet, with one row for each observation and one column for each variable. Each column has a column name.

Next week we will work on loading our own data into R. This week we will focus on in-built R datasets.

Let’s choose one now. I’m going to work with the pirates dataset from the yarrr package. We can choose the data here.

library(yarrr)
library(tidyverse)
?pirates
dataset <- pirates

To have a look at the data there are many options. You can either:

  • click on its name in the environment tab
  • Type its name into the console or into a code chunk (e.g. for our table, type dataset into the console or a code chunk)
  • Run the command View(variable_name) (View is a command from the tidyverse package)
  • Run the command print(variable_name)
  • Run the command head(variable_name) to see the first 6 lines or so (good for quick checks)

For example

# Note, there are more columns to the right, use the arrow to see
head(dataset)
##   id    sex age height weight headband college tattoos tchests
## 1  1   male  28 173.11   70.5      yes   JSSFP       9       0
## 2  2   male  31 209.25  105.6      yes   JSSFP       9      11
## 3  3   male  26 169.95   77.1      yes    CCCC      10      10
## 4  4 female  31 144.29   58.5       no   JSSFP       2       0
## 5  5 female  41 157.85   58.4      yes   JSSFP       9       6
## 6  6   male  26 190.20   85.4      yes    CCCC       7      19
##   parrots favorite.pirate sword.type eyepatch sword.time
## 1       0    Jack Sparrow    cutlass        1       0.58
## 2       0    Jack Sparrow    cutlass        0       1.11
## 3       1    Jack Sparrow    cutlass        1       1.44
## 4       2    Jack Sparrow   scimitar        1      36.11
## 5       4            Hook    cutlass        1       0.11
## 6       0    Jack Sparrow    cutlass        1       0.59
##   beard.length           fav.pixar grogg
## 1           16      Monsters, Inc.    11
## 2           21              WALL-E     9
## 3           19          Inside Out     7
## 4            2          Inside Out     9
## 5            0          Inside Out    14
## 6           17 Monsters University     7

To see what the column names are, you can use the names(dataset) command

names(dataset)
##  [1] "id"              "sex"             "age"            
##  [4] "height"          "weight"          "headband"       
##  [7] "college"         "tattoos"         "tchests"        
## [10] "parrots"         "favorite.pirate" "sword.type"     
## [13] "eyepatch"        "sword.time"      "beard.length"   
## [16] "fav.pixar"       "grogg"

Or look at the summary

summary(dataset) 
##        id             sex                 age       
##  Min.   :   1.0   Length:1000        Min.   :11.00  
##  1st Qu.: 250.8   Class :character   1st Qu.:24.00  
##  Median : 500.5   Mode  :character   Median :27.00  
##  Mean   : 500.5                      Mean   :27.36  
##  3rd Qu.: 750.2                      3rd Qu.:31.00  
##  Max.   :1000.0                      Max.   :46.00  
##      height          weight         headband        
##  Min.   :129.8   Min.   : 33.00   Length:1000       
##  1st Qu.:161.4   1st Qu.: 62.08   Class :character  
##  Median :169.9   Median : 69.55   Mode  :character  
##  Mean   :170.2   Mean   : 69.74                     
##  3rd Qu.:178.5   3rd Qu.: 76.90                     
##  Max.   :209.2   Max.   :105.60                     
##    college             tattoos          tchests      
##  Length:1000        Min.   : 0.000   Min.   :  0.00  
##  Class :character   1st Qu.: 7.000   1st Qu.:  6.00  
##  Mode  :character   Median :10.000   Median : 15.00  
##                     Mean   : 9.429   Mean   : 22.69  
##                     3rd Qu.:12.000   3rd Qu.: 30.00  
##                     Max.   :19.000   Max.   :147.00  
##     parrots       favorite.pirate     sword.type       
##  Min.   : 0.000   Length:1000        Length:1000       
##  1st Qu.: 1.000   Class :character   Class :character  
##  Median : 2.000   Mode  :character   Mode  :character  
##  Mean   : 2.819                                        
##  3rd Qu.: 4.000                                        
##  Max.   :27.000                                        
##     eyepatch       sword.time        beard.length  
##  Min.   :0.000   Min.   :  0.0000   Min.   : 0.00  
##  1st Qu.:0.000   1st Qu.:  0.2175   1st Qu.: 0.00  
##  Median :1.000   Median :  0.5850   Median : 9.00  
##  Mean   :0.658   Mean   :  2.5427   Mean   :10.38  
##  3rd Qu.:1.000   3rd Qu.:  1.3300   3rd Qu.:20.00  
##  Max.   :1.000   Max.   :169.8800   Max.   :40.00  
##   fav.pixar             grogg      
##  Length:1000        Min.   : 0.00  
##  Class :character   1st Qu.: 8.00  
##  Mode  :character   Median :10.00  
##                     Mean   :10.14  
##                     3rd Qu.:12.00  
##                     Max.   :21.00
# You can also use the skimr package and skim command here
# skim(pirates)

To see how many columns and rows there are, you can use the nrow() and ncol() commands

# there are 1000
nrow(dataset)
## [1] 1000
ncol(dataset)
## [1] 17

1.8.2 Selecting individual values

You can select rows and columns using SQUARE brackets and [rows,columns]. For example

#select row 2 and columns 4,5 & 6 (that's what the : does, counts from 4 to 6)
dataset[2 , 4:6]
##   height weight headband
## 2 209.25  105.6      yes
#select rows 1,3 and 5 - plus the height and eyepatch columns
# the c() command lets you choose whatever values you want and sticks (concatenates) them together
dataset[c(1,3,5),c("height","eyepatch")]
##   height eyepatch
## 1 173.11        1
## 3 169.95        1
## 5 157.85        1
# Leave the right side empty for all the columns
# e.g this prints all columns of rows 2,3 and 4
dataset[2:4, ]
##   id    sex age height weight headband college tattoos tchests
## 2  2   male  31 209.25  105.6      yes   JSSFP       9      11
## 3  3   male  26 169.95   77.1      yes    CCCC      10      10
## 4  4 female  31 144.29   58.5       no   JSSFP       2       0
##   parrots favorite.pirate sword.type eyepatch sword.time
## 2       0    Jack Sparrow    cutlass        0       1.11
## 3       1    Jack Sparrow    cutlass        1       1.44
## 4       2    Jack Sparrow   scimitar        1      36.11
##   beard.length  fav.pixar grogg
## 2           21     WALL-E     9
## 3           19 Inside Out     7
## 4            2 Inside Out     9
# or all the values in a column - I leave the left side blank to select all rows
# Here I assigned the answer to a new variable called b, That's why it didn't print out.
b <- dataset[ ,c("college","tattoos")]

1.8.3 Selecting a single column

The easiest way to do this is with the $ symbol. LEARN THIS! It is incredibly useful

For example, to view the age column of the pirates table, I would type pirates$age - so to make a histogram of the pirates ages, I would type

hist(dataset$age,xlab="Pirates ages")

1.8.4 Selecting mulitple columns

Sometimes your table is very large and you might only care about a few of the variables. So we could make a new table where we only keep the columns we care about. For example, as described in the R cheatsheet

picture from the tidyverse cheatsheet

Figure 1.14: picture from the tidyverse cheatsheet

Inside the tidyverse packages, there are some easy functions for selecting columns or rows. To choose specific columns, we can use the dplyr::select() function from the tidyverse package or square brackets. Whatever makes you more comfortable.

Let’s select only the “age”,“height”,“sex”,“weight”,“favorite.pirate” and “beard.length” columns:

# print the column names, so its easier to write the command
names(dataset)
##  [1] "id"              "sex"             "age"            
##  [4] "height"          "weight"          "headband"       
##  [7] "college"         "tattoos"         "tchests"        
## [10] "parrots"         "favorite.pirate" "sword.type"     
## [13] "eyepatch"        "sword.time"      "beard.length"   
## [16] "fav.pixar"       "grogg"
# The "traditional R way"
newdata <- dataset[,c("age","height","sex","weight","favorite.pirate",
                      "beard.length")]

# The "tidyverse way"
# Note, i'm telling it that I want select from the dplyr package because 
#        R wanted to choose a different "select" function
identicalnewdata <- dplyr::select(as_tibble(dataset),c(age,height,weight,
                                                       favorite.pirate,beard.length))

# have a look at what we did
head(newdata)
##   age height    sex weight favorite.pirate beard.length
## 1  28 173.11   male   70.5    Jack Sparrow           16
## 2  31 209.25   male  105.6    Jack Sparrow           21
## 3  26 169.95   male   77.1    Jack Sparrow           19
## 4  31 144.29 female   58.5    Jack Sparrow            2
## 5  41 157.85 female   58.4            Hook            0
## 6  26 190.20   male   85.4    Jack Sparrow           17

1.8.5 Selecting subsets of rows/observations

Very commonly, we want to select only some rows in our table. For example, let’s say I want to look at the distribution of men’s beard lengths. This means that I only want to select the men in the pirates dataset.

Again there are two ways of doing this:

# e.g. select rows have the column sex set equal to male

# The "traditional R way". You can use either == or %in% for your selection
malepirates <- newdata[which(newdata$sex %in% "male"),]
malepirates <- newdata[which(newdata$sex  ==  "male"),]

# The "tidyverse way". 
malepirates <- dplyr::filter(newdata, sex == "male")

# have a look at what we did
head(malepirates)
##   age height  sex weight favorite.pirate beard.length
## 1  28 173.11 male   70.5    Jack Sparrow           16
## 2  31 209.25 male  105.6    Jack Sparrow           21
## 3  26 169.95 male   77.1    Jack Sparrow           19
## 4  26 190.20 male   85.4    Jack Sparrow           17
## 5  30 183.52 male   84.7    Jack Sparrow           25
## 6  20 163.65 male   70.0    Jack Sparrow           27

You can also use other symbols to select data

  • >= greater than or equal to
  • > greater than
  • == equal to
  • != NOT equal to
  • < less than
  • <= less than or equal to
# What is the average weight of pirates who are over 150cm tall?
tallpirates <- dplyr::filter(dataset, height > 180)
mean(tallpirates$weight)
## [1] 83.9277

Or you could use multiple conditions using the & symbol or | symbol

# What is the average weight of MALE pirates who are over 150cm tall?
tall.male.pirates <- dplyr::filter(dataset, height > 180 & sex == "male")
mean(tall.male.pirates$weight)
## [1] 84.18984
# What is the maximum number of tattoos on tall pirates who have at least 1 parrot OR an eyepatch
tall.parrot.eye <- dplyr::filter(dataset, height > 180 & (parrots >= 1 |eyepatch == 1 ))
max(tall.parrot.eye$tattoos)
## [1] 18

1.8.6 Looking at summary tables

We often want to see summary statistics of our data. We covered the summary command last week and above. But what about categorical data? Just how many pirates liked Monsters Inc.? We can look at this using the table() command.

This counts the number of rows containing each category. For example, we can see here that 50 pirates out of the 1000 chose Monsters Inc as their favourite film

table(dataset$fav.pixar)
## 
##        A Bug's Life               Brave                Cars 
##                  28                  21                  21 
##              Cars 2        Finding Nemo          Inside Out 
##                  14                 162                 304 
## Monsters University      Monsters, Inc.         Ratatouille 
##                  94                  50                   3 
##     The Incredibles           Toy Story         Toy Story 2 
##                  66                  16                  25 
##         Toy Story 3                  Up              WALL-E 
##                  13                 118                  65

We can also go beyond this. For example, let’s break it up by gender - where we can see that 2 non-binary pirates liked the film.

table(dataset$fav.pixar,dataset$sex )
##                      
##                       female male other
##   A Bug's Life            21    6     1
##   Brave                    7   11     3
##   Cars                    10    9     2
##   Cars 2                   9    5     0
##   Finding Nemo            67   85    10
##   Inside Out             139  154    11
##   Monsters University     43   48     3
##   Monsters, Inc.          20   28     2
##   Ratatouille              1    1     1
##   The Incredibles         34   28     4
##   Toy Story                6    9     1
##   Toy Story 2             10   14     1
##   Toy Story 3              7    6     0
##   Up                      55   59     4
##   WALL-E                  35   27     3

Going one step further, we can see that both of our two non-binary pirates who chose Monsters Inc., both use cutlasses.

table(dataset$fav.pixar,dataset$sex,dataset$sword.type)
## , ,  = banana
## 
##                      
##                       female male other
##   A Bug's Life             0    0     0
##   Brave                    1    1     0
##   Cars                     0    0     0
##   Cars 2                   1    0     0
##   Finding Nemo             3    2     1
##   Inside Out               3   10     0
##   Monsters University      4    4     1
##   Monsters, Inc.           0    3     0
##   Ratatouille              0    0     0
##   The Incredibles          0    0     0
##   Toy Story                1    1     0
##   Toy Story 2              0    0     0
##   Toy Story 3              0    0     0
##   Up                       7    1     0
##   WALL-E                   1    0     1
## 
## , ,  = cutlass
## 
##                      
##                       female male other
##   A Bug's Life            20    5     0
##   Brave                    6    8     3
##   Cars                     9    6     2
##   Cars 2                   8    4     0
##   Finding Nemo            53   72     9
##   Inside Out             116  131     9
##   Monsters University     31   41     2
##   Monsters, Inc.          17   23     2
##   Ratatouille              1    1     1
##   The Incredibles         28   23     3
##   Toy Story                5    6     1
##   Toy Story 2              9   14     1
##   Toy Story 3              6    5     0
##   Up                      40   54     3
##   WALL-E                  30   21     1
## 
## , ,  = sabre
## 
##                      
##                       female male other
##   A Bug's Life             0    1     0
##   Brave                    0    0     0
##   Cars                     0    1     0
##   Cars 2                   0    1     0
##   Finding Nemo             9    7     0
##   Inside Out               9   10     2
##   Monsters University      5    2     0
##   Monsters, Inc.           1    1     0
##   Ratatouille              0    0     0
##   The Incredibles          1    2     0
##   Toy Story                0    2     0
##   Toy Story 2              1    0     0
##   Toy Story 3              0    1     0
##   Up                       5    1     1
##   WALL-E                   0    3     1
## 
## , ,  = scimitar
## 
##                      
##                       female male other
##   A Bug's Life             1    0     1
##   Brave                    0    2     0
##   Cars                     1    2     0
##   Cars 2                   0    0     0
##   Finding Nemo             2    4     0
##   Inside Out              11    3     0
##   Monsters University      3    1     0
##   Monsters, Inc.           2    1     0
##   Ratatouille              0    0     0
##   The Incredibles          5    3     1
##   Toy Story                0    0     0
##   Toy Story 2              0    0     0
##   Toy Story 3              1    0     0
##   Up                       3    3     0
##   WALL-E                   4    3     0

1.8.7 Basic plots (one variable)

In general, there are two very good websites for making professional graphics in R. I regularly browse both of them along with stack overflow to generate plots to be proud of, or to ask questions like “how do I change the axis size?”.

Histograms

We discussed the basic histogram command last week: hist(variable). For example hist(malepirates$age) makes a histogram of the age column in the data.frame we created earlier on male pirates.

Boxplots

We discussed the basic boxplot command last week: boxplot(variable). For example boxplot(malepirates$age) makes a boxplot of the age column in the data.frame we created earlier on male pirates. To be more fancy:

boxplot(age~sex,         
        data=dataset,                       
        xlab="Gender",ylab="Age") 
        #names=c("Male","Female","Non-binary"))

QQ-Norm plots

We discussed the basic qqnorm command last week: qqplot(variable). For example qqplot(malepirates$age) makes a qq-norm plot of the age column in the data.frame we created earlier on male pirates. There is a nicer version inside the ggpubr package.

library(ggpubr)
ggqqplot(malepirates$age,col="blue")

1.9 Tutorial 2D: Distributions and tests

We have talked about several distributions and tests so far in the lab. To see the help files for most of them, see ?Distributions

1.9.1 Normal distribution

To see the help file for all these:

?Normal

To generate a random sample from a normal distribution:

sample.normal <- rnorm(n=100,mean=4,sd=2)

To calculate a z score from your sample/population, you can use R as a calculator.

To calculate the probability of greater/lesser than a value in a given normal distribution (e.g. you can use this as an interactive table)

# probability of less than 1.7 in a normal distribution of N(4,2^2)
pnorm(1.7,mean=4,sd=2,lower.tail = TRUE)
## [1] 0.1250719
# probability of greater than 1.8 in a normal distribution of N(4,2^2)
1 - pnorm(1,mean=4,sd=2,lower.tail = TRUE)
## [1] 0.9331928
# or
pnorm(1,mean=4,sd=2,lower.tail = FALSE)
## [1] 0.9331928

To calculate the value for a given probability

# what value is less than 60% of the data?
qnorm(0.6,mean=4,sd=2,lower.tail = TRUE)
## [1] 4.506694
# what value is greater than 80% of the data?
qnorm(0.8,mean=4,sd=2,lower.tail = FALSE)
## [1] 2.316758

1.9.2 Student’s t-distribution

What even is this? See this nice resource: https://365datascience.com/tutorials/statistics-tutorials/students-t-distribution/

To see the help file for all these:

?TDist

To calculate a t-statistic from your sample/population, you can use R as a calculator. To calculate the probability of greater/lesser than a value in a given t-distribution (e.f. you can use this as an interactive t-table)

# probability of seeing less than 1.7 in a  t-distribution 
# with 20 degrees of freedom
pt(1.55,df=20,lower.tail = TRUE)
## [1] 0.9315892

To calculate the value for a given probability

# what value is greater than 90% of the data in a t-distribution with df=25
qt(0.9,df=25,lower.tail = TRUE)
## [1] 1.316345

To conduct a full t-test on some data:

# Conduct a one sided t-test where we think that H0: mu(age)=30 on our pirates data (e.g. H1: mu(age) != 30 )
t.test(dataset$age,mu=30,alternative="two.sided")
## 
## 	One Sample t-test
## 
## data:  dataset$age
## t = -14.427, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
##  27.00092 27.71908
## sample estimates:
## mean of x 
##     27.36

or see the detailed tutorial here: http://www.sthda.com/english/wiki/one-sample-t-test-in-r for one-sample

and here for comparing two samples: http://www.sthda.com/english/wiki/unpaired-two-samples-t-test-in-r

1.9.3 Wilks Shapiro test for normality

To test for normality:

# Conduct a W-S test for normality on the ages of pirates
shapiro.test(dataset$age)
## 
## 	Shapiro-Wilk normality test
## 
## data:  dataset$age
## W = 0.99617, p-value = 0.01456

1.10 Tutorial 3A - Editing code chunks

R code chunks are really useful as you can completely control the output seen in your final report. To edit these, we edit the bit INSIDE the curly brackets at the top of the code chunk.

For example,

Replace option1, option2 etc with the things you want the chunk to do

Figure 1.15: Replace option1, option2 etc with the things you want the chunk to do

There are a huge number of options available as described here: https://bookdown.org/yihui/rmarkdown-cookbook/chunk-options.html

For now, we will focus on some basics:

Q: How can I stop all the friendly “loading library” text coming up when I load a library?

Ans: we include message=FALSE and warnings=FALSE at the top of the code chunk e.g.

library("yarrr")

Q: How can I add a figure caption?

Ans. Using the fig.cap argument e.g. ```{r,fig.cap=“my caption”}

Q: How can I make the code invisible?

Ans: Unless specifically asked, I don’t recommend doing this in STAT-462 because we are grading you on your ability to code! But it is a useful skill. Here is how to do it for a figure insert (the most common request). I also gave my code chunk a name (chipmunk.pic), so that I can refer to it later in the text.

For example, this code:

Will lead to this output (the caption only shows up when you knit)

Summer!

(#fig:chipmunk.pic)Summer!

1.11 Tutorial 3B - Regression Models

Now we will fit our first regression model.

1.11.1 “Standard” regression output

The command to do this is lm() e.g. linear model.

output <- lm(y_column ~ x_column,data=tablename)
output

NOTE, THE WEIRD ~ SYMBOL. This means “depends on” and it’s how we tell R what the response variable is. E.g. y depends on x, or y=mx+c. 

For example for the starwars data, it would be

# response = mass, predictor = height
star.lm <- lm(mass ~ height,data=starwars)
star.lm
## 
## Call:
## lm(formula = mass ~ height, data = starwars)
## 
## Coefficients:
## (Intercept)       height  
##    -13.8103       0.6386

You can also look at the summary by looking at the summary command:

summary(star.lm)
## 
## Call:
## lm(formula = mass ~ height, data = starwars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -61.43  -30.03  -21.13  -17.73 1260.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.8103   111.1545  -0.124    0.902
## height        0.6386     0.6261   1.020    0.312
## 
## Residual standard error: 169.4 on 57 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.01792,	Adjusted R-squared:  0.0006956 
## F-statistic:  1.04 on 1 and 57 DF,  p-value: 0.312

In both cases, we have an estimate of the intercept (0.6386) and of the gradient (-13.8103). We will discuss the other values in later labs/lectures.

Now let’s see how to add the regression line to our scatterplot. We can do this using abline(REGRESSION_VARIABLE), where regression_variable is the name of the variable you saved the output of lm to. For example.

plot(starwars$height,starwars$mass)
abline(star.lm)

1.11.2 “Fancy” OLSRR regression output

If you want a different way of seeing the same output, you can use the ols_regress() command inside the olsrr package.

library(olsrr)
star.ols.lm <- ols_regress(mass ~ height,data=starwars)
star.ols.lm
##                           Model Summary                            
## ------------------------------------------------------------------
## R                        0.134       RMSE                 166.502 
## R-Squared                0.018       Coef. Var            174.078 
## Adj. R-Squared           0.001       MSE                28695.756 
## Pred R-Squared          -0.017       MAE                   46.669 
## ------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                 ANOVA                                  
## ----------------------------------------------------------------------
##                    Sum of                                             
##                   Squares        DF    Mean Square      F        Sig. 
## ----------------------------------------------------------------------
## Regression      29854.272         1      29854.272     1.04    0.3120 
## Residual      1635658.069        57      28695.756                    
## Total         1665512.342        58                                   
## ----------------------------------------------------------------------
## 
##                                     Parameter Estimates                                      
## --------------------------------------------------------------------------------------------
##       model       Beta    Std. Error    Std. Beta      t        Sig        lower      upper 
## --------------------------------------------------------------------------------------------
## (Intercept)    -13.810       111.155                 -0.124    0.902    -236.393    208.773 
##      height      0.639         0.626        0.134     1.020    0.312      -0.615      1.892 
## --------------------------------------------------------------------------------------------

The ols_regress command produces beautiful output, but sometimes it doesn’t work well with other commands. So I tend to run a lm command at the same time to have both available.

Sometimes, this command can produce a weird error:

This is probably because you loaded the moderndive package

(#fig:olsrr.error)This is probably because you loaded the moderndive package

This is probably because you loaded the moderndive package. They do not play nicely together. Save your work, restart R and do not run any line that says library(moderndive)!.

1.12 Tutorial 4A: Plotting in ggplot2

There are many tutorials out there, so I don’t want to spam you with my additional instructions.

Here are my top resources for learning how to plot in ggplot2.

  1. There is a data camp course on it! https://learn.datacamp.com/courses/introduction-to-data-visualization-with-ggplot2
  2. Modifiable examples: https://www.r-graph-gallery.com
  3. For basic plotting, I like this: http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization
  4. Another nice tutorial to work through: https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html
  5. More details: https://r4ds.had.co.nz/data-visualisation.html
  6. Every detail you could ever want. I use this as a reference. https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/

1.13 Tutorial 4B: Identifying/removing outliers

Note this tutorial is not about whether you SHOULD remove an outlier or any other value, it’s simply about the mechanics of how to do so if you choose.

OK, let’s make a test dataset. I can do this using the data.frame() command.

test <- data.frame(VarA=c(2,5,3:7,2,6,2,10),VarB=c(1,4,3:7,5,2,5,-30))
test
##    VarA VarB
## 1     2    1
## 2     5    4
## 3     3    3
## 4     4    4
## 5     5    5
## 6     6    6
## 7     7    7
## 8     2    5
## 9     6    2
## 10    2    5
## 11   10  -30
plot(test$VarA,test$VarB,pch=16)

There is a point that looks very suspicious and we want to identify it.

Method 1:Filter

We can look at the plot, then use the filter command to identify it and print it out onto the screen.

filter(test,VarB < -20 & VarA >9)
##   VarA VarB
## 1   10  -30

or similarly:

# This version shows the row number
test[(test$VarB < -20)&(test$VarA >9),]
##    VarA VarB
## 11   10  -30

We can check the outlier by recoloring it on our plot using the lines() command: (NOTE YOU NEED BOTH PLOT AND LINES IN THE SAME CODE CHUNK FOR THIS TO WORK)

outlier <- filter(test,VarB < -20 & VarA >9)

# Make the initial plot
# Add the outlier as type="p" for points.
plot(test$VarA,test$VarB, pch=16)
lines(outlier$VarA,outlier$VarB,col="red",type="p",pch=16)

To remove it, we simply use filter the other way, e.g. use it to select everything else..

newtest <- filter(test,VarB > -20)
plot(newtest$VarA,newtest$VarB,pch=16)

Method 2: Identify Command

Although R is mostly not interactive, there is one interactive tool you can use to click on a point and identify it, the identify() command.

THIS ONLY WORKS IN THE CONSOLE, DON’T RUN IN A CODE CHUNK

  1. First, make sure that you can clearly see the plot quadrant of your screen (between Files & Packages).

  2. In the console, plot the data e.g. run plot(test$VarA,test$VarB). You should see the plot clearly

  3. Now, in the console, type identify(test$VarA,test$VarB). When you press enter, your mouse will turn into a little cross. Click on all the points you want to identify THEN PRESS ESCAPE. The row number of the points you clicked on will appear in the plot

So now we know it is row 11 that is causing the problem. We can take a look at it by typing

# E.g. JUST row 11 and all the columns
test[11, ]
##    VarA VarB
## 11   10  -30

and we can remove it with the minus sign

newtest <- test[-11, ]
plot(newtest$VarA,newtest$VarB,pch=16)

Remember that if you do this more than once, all the row numbers will move around, so I am a fan of using identify to take a look at the data, but using filter to remove it.

1.14 Tutorial 4C Writing equations in R-Studio

1.14.1 In Markdown

Double dollar signs

It is relatively easy to write equations in R markdown. They use the “LaTeX” format and you put them between double dollar signs.

For example, try typing $$x=2$$ on a new line of the white text area in your script ( NOT INTO A CODE CHUNK), then pressing knit. You should see:

\[x=2\]

But how do you write all the fancy equation stuff? We cheat.

Create the equation you want in one of these generators, then copy the code into your script and put it between double dollar signs:

For example

$$\widehat{y} = b_{0}+b_{1}x$$

Shows up as \[\widehat{y} = b_{0}+b_{1}x\]

Single dollar signs

Essentially this is the same, but the equation is part of the text and you only see the output when you press knit. For example including $x=2$ in this sentence shows \(x=2\) as an output.

1.14.2 Using Equatiomatic

There is a really neat package called equatiomatic that we can use to directly make equations from our linear models.

Tutorial here: https://cran.r-project.org/web/packages/equatiomatic/vignettes/intro-equatiomatic.html

Install this package (e.g. using install.packages("equatiomatic")) and load it by putting library(equatiomatic) inside your “libraries” code chunk and re-running.

We can now extract an equation from any linear model variable. For example,

data(starwars)
mymodel <- lm(mass~height,data=starwars)
extract_eq(mymodel)

\[ \operatorname{mass} = \alpha + \beta_{1}(\operatorname{height}) + \epsilon \]

If you run it in the console, the command returns the LaTeX code you need to add it into the text yourself.

Let’s say our data is a sample, not the population, then we want to use lower case names for our sample statistics (b0 & b1). There is no easy way to do this in the command, so I literally run the command in the console, then copy the command over to the TEXT part of the report and change the betas to ’b’s

If you want to use the actual numbers, simply turn on use_coefs=TRUE.

extract_eq(mymodel, use_coefs = TRUE)

\[ \operatorname{\widehat{mass}} = -13.81 + 0.64(\operatorname{height}) \]

Finally, you probably don’t want the code chunk showing up, so you can use the “echo=FALSE” argument inside the {r} part to stop the code chunk from being visible (see Tutorial 3A). For example, I ran exactly the same code chunk with echo=FALSE activated and got this:

\[ \operatorname{\widehat{mass}} = -13.81 + 0.64(\operatorname{height}) \]

1.15 Tutorial 5A ANOVA

There are two quick and easy ways to make an ANOVA table. Lecture 13 describes it in a lot more detail.

1.15.1 Base R

Simply use the ANOVA command. For example

data(starwars)
mymodel <- lm(mass~height,data=starwars)
summary(mymodel)
## 
## Call:
## lm(formula = mass ~ height, data = starwars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -61.43  -30.03  -21.13  -17.73 1260.06 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.8103   111.1545  -0.124    0.902
## height        0.6386     0.6261   1.020    0.312
## 
## Residual standard error: 169.4 on 57 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.01792,	Adjusted R-squared:  0.0006956 
## F-statistic:  1.04 on 1 and 57 DF,  p-value: 0.312
anova(mymodel)
## Analysis of Variance Table
## 
## Response: mass
##           Df  Sum Sq Mean Sq F value Pr(>F)
## height     1   29854   29854  1.0404  0.312
## Residuals 57 1635658   28696

1.15.2 Using OLSRR

In OLSRR, it’s even easier. Our standard ols_regress() command includes ANOVA.

data(starwars)
mymodel <- lm(mass~height,data=starwars)
ols_regress(mymodel)
##                           Model Summary                            
## ------------------------------------------------------------------
## R                        0.134       RMSE                 166.502 
## R-Squared                0.018       Coef. Var            174.078 
## Adj. R-Squared           0.001       MSE                28695.756 
## Pred R-Squared          -0.017       MAE                   46.669 
## ------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                 ANOVA                                  
## ----------------------------------------------------------------------
##                    Sum of                                             
##                   Squares        DF    Mean Square      F        Sig. 
## ----------------------------------------------------------------------
## Regression      29854.272         1      29854.272     1.04    0.3120 
## Residual      1635658.069        57      28695.756                    
## Total         1665512.342        58                                   
## ----------------------------------------------------------------------
## 
##                                     Parameter Estimates                                      
## --------------------------------------------------------------------------------------------
##       model       Beta    Std. Error    Std. Beta      t        Sig        lower      upper 
## --------------------------------------------------------------------------------------------
## (Intercept)    -13.810       111.155                 -0.124    0.902    -236.393    208.773 
##      height      0.639         0.626        0.134     1.020    0.312      -0.615      1.892 
## --------------------------------------------------------------------------------------------

1.16 Tutorial 5B Diagnostics part A

We know from the lectures that there are multiple assumptions underpinning whether a regression is a good model or not.These are described in detail in:

Regression diagnostics are a set of tools we can use to assess these assumptions. Many regression diagnostics include plots of standardized residuals, which we will focus on today.

Let’s use the example from the online textbook. Some researchers (Urbano-Marquez, et al., 1989) were interested in determining whether or not alcohol consumption was linearly related to muscle strength. The researchers measured the total lifetime consumption of alcohol (x) on a random sample of n = 50 alcoholic men. They also measured the strength (y) of the deltoid muscle in each person’s nondominant arm.

testdata <- read.csv("alcoholstrength.csv")
mymodel <- lm(strength~alcohol,data=testdata)
plot(testdata$alcohol,testdata$strength,xlab="Alcohol",ylab="Strength",pch=16)
abline(mymodel)

1.16.1 Residual plots

It can be difficult looking at the raw data to assess whether many of the assumptions are true or not. So in this case we can look at the residuals

We can access the residuals and many other model elements directly from the model

names(mymodel)
##  [1] "coefficients"  "residuals"     "effects"      
##  [4] "rank"          "fitted.values" "assign"       
##  [7] "qr"            "df.residual"   "xlevels"      
## [10] "call"          "terms"         "model"
# see ?lm for more explanation
testdata$strength_predicted <- mymodel$fitted.values
testdata$residuals          <- mymodel$residuals
head(testdata)
##   alcohol strength strength_predicted residuals
## 1    36.2     1000           1565.911 -565.9109
## 2    39.7     1000           1462.357 -462.3570
## 3    39.5     1080           1468.274 -388.2744
## 4    18.2     1220           2098.474 -878.4739
## 5    29.2     1310           1773.019 -463.0187
## 6    32.5     1400           1675.382 -275.3822

Residuals vs fitted values

The most common residual plot you will see is residuals vs predicted values for each point. You can see how to interpret it here: https://online.stat.psu.edu/stat462/node/117/

Manually this is:

plot(testdata$strength_predicted,testdata$residuals)
abline(h=0) #plot a horizontal line

However R has many built in functions for this one:

# using the olsrr package
ols_plot_resid_fit(mymodel)
A residual plot for data that likely passes all the tests for linear regression

Figure 1.16: A residual plot for data that likely passes all the tests for linear regression

We want our residuals to be normally distributed, random (e.g. independent) and showing no pattern or unequal variance.

Residuals vs predictor

Alternatively, we can use this to manually make a plot of the residuals against the x-values (the predictor) - see here for how to interpret this and compare it to the fitted value plot https://online.stat.psu.edu/stat462/node/118/

plot(testdata$alcohol,testdata$residuals)
abline(h=0) #plot a horizontal line

Let’s use these to see how we can test the four assumptions underpinning linear regression (LINE).

Further residual interpretation can be seen here: https://online.stat.psu.edu/stat462/node/124/

1.16.2 Checking Linearity

If you are applying a linear model to data which is non-linear, you are in all sorts of trouble. Your model is simply a bad model and you will need some sort of transformation, or simply a better model altogether.

Don't be the person who makes this linear model!

Figure 1.17: Don’t be the person who makes this linear model!

ALWAYS PLOT YOUR DATA AND LOOK AT IT (even if I don’t tell you to explicitly). For more examples of why this is useful to do, see here: https://online.stat.psu.edu/stat462/node/123/

However, in real life, it can often be difficult to assess whether the fit is linear by looking at the scatterplot alone. So instead, we look at the residuals, whcih often show a pattern much more clearly. For example

treadwear <- read.csv("treadwear.csv")
tread_model <- lm(mileage~groove,data=treadwear)
plot(treadwear$groove,treadwear$mileage,xlab="groove",ylab="mileage",pch=16)
abline(tread_model)
This looks relatively linear, but....

Figure 1.18: This looks relatively linear, but….

# using the olsrr package
ols_plot_resid_fit(tread_model)
A parabola would clearly fit the data better

Figure 1.19: A parabola would clearly fit the data better

Compare this residual plot to the strength/alcohol example above in Figure 1.16. You can see in the treadwear data, the residuals depart from 0 in a very systematic manner. They are clearly positive for small x values, negative for medium x values, and positive again for large x values. Clearly, a non-linear model would better describe the relationship between the two variables. In future classes we will touch upon polynomial models that you might use to deal with this.

1.16.3 Checking Independence

This is a tough one. Essentially we are asking if each row in our table is independent and representative of the population. So essentially we are looking for confounding variables. Or to put it another way, is there another variable that we need to include in our model.

Let’s say our data is not independent but we notice. What happens? First off, this isn’t so terrible as whether linearity is broken. The linear relationship likely still stands between our predictor and our response variable. So for example, there really is a relationship between height and weight that’s relatively linear. Or house prices in our State College pre-regressinon example might really be taken from a normal distribution.

BUT, it we had taken into account this new variable ALSO, we might have done an even better job. So for example, the relationship between height and weight is definitely affected by someone’s gender. Or in the example from Lab 3, knowing how close a house is to “downtown” is a better predictor of price than knowing how far East/West it is.

Equally, the non independence of our data makes it hard for us to be sure how good our model is or for us to make predictions. For example, if we were looking at height and weight and our sample was made up only of people who identified as women, then there might be clear limitations in predicting the weight of the men’s basketball team.

Because every case is different, there is no “set method” for checking independence, but there are a few common ways to check things that you are suspicious of. So the first item is to think about the problem. What might impact your response variable that you have not taken into account.

Covariance matrix

Looking at correlations is a quick (but often misleading) way to assess what is happening. Essentially we can look at the correlation between each column of data.

The basic way to do this is very quick and easy. For example, let’s look at our Star Wars height/data.

The GGally function allows us to see the histogram of each column we choose to plot - along with the correlation of any numeric pairs. Let’s look at this for a new dataset looking at how much people are tipped in New York restaurants.

# Quick display of two capabilities of GGally, to assess the distribution and correlation of variables 
house <- read_excel("Lab03_house.xlsx")

# Choose column names - let's say I don't care about location
colnames(house)
## [1] "House.Price"      "House.Age"        "Distance.Station"
## [4] "Number.Shops"     "Latitude"         "Longitude"
# Create plot - note I have message=TRUE and warning=TRUE turned on at the top of my code chunk
ggpairs(house[,c("House.Price" ,"House.Age","Distance.Station","Number.Shops" )])

You can simply look at the correlations of any NUMERIC columns using the corrplot code, as we did in Lab 3.

library(corrplot)
house.subset <- house[ , sapply(house,is.numeric)]
corrplot(cor(house.subset),method="ellipse",type="lower")

There are LOADS of other ways to run correlation plots here: https://www.r-graph-gallery.com/correlogram.html Feel free to choose a favourite.

Importantly, remember back to this website - https://www.tylervigen.com/spurious-correlations. Just because another variable is correlated with our response does not mean it HAS to be in the model. It simply means that you might want to consider whether there is a reason for that correlation.

Scatterplots and residual plots with a suspected confounding variable

A second way we looked at in a previous lab was to add colour to our scatterplot and see if we can see any influence from a suspected confounding variable.

# Create a plot
p <- house %>%                  
  ggplot( aes(Latitude,House.Price, col= Distance.Station)) +
  geom_point() +
  theme_classic()+
  scale_color_gradient(low="blue", high="red")

# and actually plot it
ggplotly(p)

However, this can be hard to see - so instead we can plot the residuals, but have the x-axis as the the variable we think is affecting independence. For example

housemodel            <- lm(House.Price~Latitude,data=house)
house$price_predicted <- housemodel$fitted.values
house$residuals       <- housemodel$residuals

plot(house$Distance.Station, house$residuals)
These residuals do not look randomly distributed on the page..

Figure 1.20: These residuals do not look randomly distributed on the page..

This definitely doesn’t look like a random cloud! It looks like distance to the station is really impacting the linear model between house price and latitude.

In this case we will use MULITPLE REGRESSION. (later in the course)

If your data includes a date or time-stamp

Then you have time-series data and there is an ENTIRE COURSE on modelling it (https://online.stat.psu.edu/stat510/). For now, if your data contains any time stamp or date, be very cautious. Changes are that your data contains auto-correlation e.g. the response tomorrow is likely to be affected by the response today.

In the short term, if you know the order in which your data was collected, you could plot a residuals/order plot as described here: https://online.stat.psu.edu/stat462/node/121/.

If your data includes a location

Spatial data is often not independent due to Tobler’s Law (close together things are more similar than those further apart). For example, imagine you were looking at the correlation between money spent on alcohol and undergraduate degree score Your relationship might be different in areas with lower drinking age-limits, or in countries where drinking is outlawed.

So knowing the location of your sample is important. The easiest thing we can do here to check for independence is to make a literal map of the residuals. For example, remember in lab 3 - let’s do something similar

# Command from the sf library
# Make a spatial version of the data using the Longitide and Latitude columns
house.spatial <- st_as_sf(house,coords=c("Longitude","Latitude"),crs = 4326)

# make interactive, for static set as "plot", for interactive set as "view"
tmap_mode("plot")

# Command from the tmap library
# and plot
qtm("Esri.WorldTopoMap") + 
     qtm(house.spatial, # data
         symbols.col="residuals", # which column for the symbols
         symbols.size=.4, # how big
         symbols.palette="RdBu",
         midpoint=0) # color breaks

There definitely looks like the residuals are not randomly distributed in space, which makes me suspicious that there is spatial autocorrelation (e.g. correlation with itself - the location of a point impacts the residual).

We can test this through a statistic called Moran’s I - which compares the residual at each point against its residuals for surrounding points. In my spatial statistics course, we spend about 3 weeks on Moran’s I, so there is a lot of nuance we are skipping over here. For more, see https://mgimond.github.io/Spatial/spatial-autocorrelation.html

library(moranfast)
moranfast(house$residuals, 
          house$Longitude, 
          house$Latitude, 
          alternative = "two.sided")
## $observed
## [1] 0.1732675
## 
## $expected
## [1] -0.002421308
## 
## $sd
## [1] 0.02267439
## 
## $p.value
## [1] 9.325873e-15

But essentially here we have a hypothesis test:

  • H0: There is no spatial autocorrelation. The “colour”/value of each residual has complete spatial randomness. I = 0
  • H1: Spatial autocorrelation is present. There is a significant pattern in the “colour”/values of each point I != 0

Test - The Global Moran’s I statistic is the correlation coefficient for the relationship between a variable (like income) and itself at surrounding values.

P-value - the probabiity of seeing this pattern if H0 really was true.

So here in this case, the p-value is very low, confirming what we visually see in the plot above. There is evidence to reject the null hypthesis and suggest that spatial autocorrelation is present.

If this is true, we can use spatial GLMs to take this into account (again, later in the course).

1.16.4 Checking Normality

There are three things we might want to check around normality and regression - but essentially we are testing whether the RESIDUALS are normal around the regression line. To get the best estimates of parameters such as B0 and B1, the residuals must be normally distributed around the regression line. but this has a much smaller effect typically than non-linearity or breaking assumptions of independence.

Breaking normality is only important in the calculation of p values for significance testing and confidence intervals, but this is only a problem when the sample size is small. When the sample size is largeer (>200), the Central Limit Theorem ensures that the distribution of residuals will approximate normality when calculating parameters.

So never throw away your data if it breaks normality. But it’s good to take a look.

Let’s go back to our original test data:

mymodel <- lm(strength~alcohol,data=testdata)
plot(testdata$alcohol,testdata$strength,xlab="Alcohol",ylab="Strength",pch=16)
abline(mymodel)

To test normality, we can use OLSRR to:

#Create a QQ plot of the residuals
ols_plot_resid_qq(mymodel)

#Create a histogram of the residuals
ols_plot_resid_hist(mymodel)

#Run a Wilks-Shapiro test for normality
ols_test_normality(mymodel)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9877         0.8767 
## Kolmogorov-Smirnov        0.0502         0.9990 
## Cramer-von Mises          4.1667         0.0000 
## Anderson-Darling          0.1699         0.9289 
## -----------------------------------------------

In this case, our residuals look very normal.

Compare that to our house dataset, which suggests that the data is mildly non normal.

housemodel <- lm(House.Price~Latitude,data=house)
plot(house$House.Price~house$Latitude,xlab="Latitude",ylab="House Price",pch=16)
abline(housemodel)

#Create a QQ plot of the residuals
ols_plot_resid_qq(housemodel)

#Create a histogram of the residuals
ols_plot_resid_hist(housemodel)

#Run a Wilks-Shapiro test for normality
ols_test_normality(housemodel)
## Warning in ks.test(y, "pnorm", mean(y), sd(y)): ties should not be
## present for the Kolmogorov-Smirnov test
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9411         0.0000 
## Kolmogorov-Smirnov        0.0776         0.0138 
## Cramer-von Mises          34.21          0.0000 
## Anderson-Darling          4.3556         0.0000 
## -----------------------------------------------

The answer to non-normal data is to use a GLM (Generalised Linear Model), which again we will get to later in the semester.

Alternatively you can use your linear model, but make the significance level more rigarous (say 1% rather than 5%) to take into account the fact that the assumptions underpinning the calculations might be flawed.

In terms of detailed interpretation, read this page: https://online.stat.psu.edu/stat462/node/122/

1.16.5 Checking Equal Variance/homoskadisity

The easiest thing to do here is to plot the residuals, then see if you see any variations in variance. For example in our house regression, the residuals look like this

ols_plot_resid_fit(housemodel)

You can clearly see here that for low values of the fitted data, there is not much variance - but there is a lot of variance between 30-50

You can also run statistical tests, explained in more detail here: https://rpubs.com/tskam/Session06

# F test - assumes residuals are  independent and identically distributed (i.i.d.)
ols_test_f(housemodel)
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of House.Price 
## 
##       Test Summary        
##  -------------------------
##  Num DF     =    1 
##  Den DF     =    412 
##  F          =    2.483727 
##  Prob > F   =    0.1157969

In this case, we can see that perhaps we neeed to worry less about unequal variance,

There is also another test available, the Breusch Pagan Test - but this relies on the residuals being normally distributed.

1.17 Tutorial 5C Confidence and Prediction Intervals

Essentially these are two different things:

Confidence interval: “Error bar” on the location of the regression line at any specific value of x, or “what is the uncertainty on the population mean of y, for a specific value of x?”

_Prediction interval: Predicting the range of likely y-values for a new data-point, or “if you had to put a new dot on the scatterplot, what range of y values would it fall under”

These are described in detail here:

In R:

#For our pirate weight/height dataset
lm.pirate <- lm(weight~height,data=pirates)
summary(lm.pirate)
## 
## Call:
## lm(formula = weight ~ height, data = pirates)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3592  -2.6564  -0.0708   2.7275  11.1451 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -68.87722    1.71250  -40.22   <2e-16 ***
## height        0.81434    0.01003   81.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.928 on 998 degrees of freedom
## Multiple R-squared:  0.8684,	Adjusted R-squared:  0.8683 
## F-statistic:  6587 on 1 and 998 DF,  p-value: < 2.2e-16
# What are the AVERAGE WEIGHTS (and uncertainty on our estimate) 
# of pirates whose heights are 150cm and 170cm?
new.pirates <- data.frame(height=c(150,170))
predict(lm.pirate,newdata=new.pirates,interval="confidence",level=0.95)
##        fit      lwr      upr
## 1 53.27345 52.80652 53.74037
## 2 69.56020 69.31640 69.80400

So we are 95% certain that on average, the AVERAGE weight of pirates who are 150cm tall falls between 52.8Kg and 53.74Kg

# A new pirate joins and her height is 160cm.  What range of weights is she likely to have?
new.pirates <- data.frame(height=c(160))
predict(lm.pirate,newdata=new.pirates,interval="predict",level=0.95)
##        fit      lwr      upr
## 1 61.41682 53.70196 69.13169

Her weight is likely to be somewhere between 53Kg and 69Kg with 95% certainty given our sample.

1.18 Tutorial 6A Diagnostics part B: Outliers

There are three key things to know when it comes to outliers:

  1. We use the word outlier to describe an observation (point on the scatterplot) that has a very different response from the predicted response from the model e.g. it does not follow the trend determined by the rest of the dataset. We consider it an outlier only if it is an extreme “y” value e.g. an extreme response.

  2. If a data point has an x-value that is extremely different (either too high or too low) from the rest of the data points, we call this a high leverage point. It might, or might not be an outlier.

  3. We call a data point an influential point if that data point has a considerable impact on the regression model. For instance, if the model fit changes considerably by removing a point, such data point is called an influential point. Influential points tend to be further from the mean. We know the regression lines goes through the mean of x and the mean of y, they tent to tilt like a see-saw.

Both outliers and the leverage points can be influential depending on where they located in a scatterplot. We can easily see them using a fits vs residuals plot (we’ll get to this below)

For more details here, see https://online.stat.psu.edu/stat501/lesson/11/11.1

1.18.1 Visual inspection

The first way we can check for simple linear regression is to plot the data and take a look. Here are some examples that we can assess by eye which show the different effects.

No outliers, influential or high leverage points
No outliers, influential values or high leverage points

Figure 1.21: No outliers, influential values or high leverage points

In the figure above, all of the data points follow the general trend of the rest of the data, so there are no outliers (in the y direction). And, none of the data points are extreme with respect to x, so there are no high leverage points. Overall, none of the data points would appear to be influential with respect to the location of the best fitting line. e.g. if we removed any one point, the line would probably be the same.

An outlier with no leverage
One outlier. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

Figure 1.22: One outlier. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

In the figure above, most of the data points follow the general trend of the rest of the data, but there is one clear outlier (one point that is unusual in the y direction). However, no point has an extreme x value, so there are no high leverage points. Overall, none of the data points would appear to be influential with respect to the location of the best fitting line. e.g.when we removed the red point,the line of best fit remains relatively stable.

A high leverage point that isn’t an outlier
No outliers,but one high leverage point. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

Figure 1.23: No outliers,but one high leverage point. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

In the figure above, most of the data points follow the general trend of the rest of the data, so there are no outliers (in the y direction). But one data points is extreme with respect to x. Overall, none of the data points would appear to be influential with respect to the location of the best fitting line. e.g.when we removed the red point,the line of best fit remained relatively stable.

An influential high leverage outlier
One influential high leverage outlier. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

Figure 1.24: One influential high leverage outlier. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

In the figure above, most of the data points follow the general trend of the rest of the data, with one clear outlier. This point also has high leverage abd appears to be very influential. e.g.when we removed the red point,the line of best fit changes hugely,

Here with a simple regression, we can easily see outliers. This is much harder when we have many predictors. So as well as examining the data by eye, we can use diagnostic plots.

1.18.2 Detecting leverage

Leverage (\(h_{ii}\)): is used for for measuring unusual observations in x-direction (high leverage point). Essentially, the leverage quantifies the influence that the observed response has on its predicted value. That is, if the leverage is small, then the observed response plays only a small role in the value of the predicted/modelled response. On the other hand, if the leverage is large, then the observed response plays a large role in the value of the predicted response.

This quantity is based on

  • The distance of \(x_{i}\) from the bulk of the x’s.
  • The extent tjat the fitted regression line is attracted by the given point.

The Leverage, \(h_{ii}\) associated with the ith datapoint is:

\[h_{ii}=\frac{1}{n}+\frac{\left(x_i-\overline{\left(x\right)}\right)^2}{\ \left\{\sum_{i=1}^n\left(x_j-\overline{\left\{x\right\}}\right)^2\right\}}\]

and is described further in https://online.stat.psu.edu/stat501/lesson/11/11.2

You don’t need to remember the equation as R does the calculation for you, but here are some important properties of the leverages:

  1. The leverage is a measure of the distance between the x value for the data point and the mean of the x values for all n data points.
  2. The leverage is a number between 0 and 1, inclusive.
  3. The sum of the equals p, the number of parameters (regression coefficients including the intercept).

A rule of thumb would be to classify classify \(x_{i}\) as a high leverage point if

\[h_{ii}>\frac{2p}{n}\] where p = number of regression coefficients (2 for SLR) and n = number of observations

We can calculate and plot the leverage easily for each point in either the base package (hatvalues), or using OLSRR (ols_leverage).

Let’s see how this works for our scenarios above. If you are coding this, you don’t necessarily need the left hand plot - but it’s helpful for me showing you what is going on.

# read the data
data <- read.csv("neither.csv")

#calculate the model
model <- lm(y~x,data=data)
leverage <- ols_leverage(model) 

# Set up 2 sub-plots one next to each other
layout(matrix(c(1,2), 1, 2, byrow = TRUE))

# plot 1 (left hand side)
plot(data$x,data$y,pch=16,xlab="x",ylab="y",main="Dataset A"); 
abline(model)

# plot 2 (right hand side). Remember to choose your own ylim
plot(data$x,leverage,pch=16,ylim=c(0,.6),
     xlab="x",ylab="y",type="h",
     main="Dataset A leverage"); 

# Calculate the critical value of leverage and add to plot
n=nrow(data)
p=length(coefficients(model))
cutLev=2*p/n
abline(h=cutLev,col="blue",lty=2)
No outliers, influential values or high leverage points.  Thr highest x point is around the threshold for high leverage, but nothing special here

Figure 1.25: No outliers, influential values or high leverage points. Thr highest x point is around the threshold for high leverage, but nothing special here

No outliers,but one high leverage point. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

Figure 1.26: No outliers,but one high leverage point. The red-dashed line is the model including all the points. The black-solid line is the model with the red point removed

An important distinction!

There is such an important distinction between a data point that has high leverage and one that has high influence that it is worth saying it one more time:

  • The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis.
  • The leverage depends only on the predictor values.
  • Whether the data point is influential or not also depends on the observed value of the reponse.

1.18.3 Detecting outliers

In lab 5, we mentioned two measures that we use to help identify outliers. They are:

  • Residuals
  • Studentized residuals (or internally studentized residuals) (often called standardized residuals)

First, briefly review these measures using this page: https://online.stat.psu.edu/stat501/lesson/11/11.3

OLSRR also offers several more plots and tests including :

  • Cook’s D Bar Plot
  • Cook’s D Chart
  • DFBETAs Panel
  • DFFITs Plot
  • Studentized Residual Plot
  • Standardized Residual Chart
  • Studentized Residuals vs Leverage Plot
  • Deleted Studentized Residual vs Fitted Values Plot
  • Hadi Plot
  • Potential Residual Plot

For now, we will focus on one of the most effective ways to assess residuals, the studentized residual/fits plot.

For example for our test data:

# read the data
data <- read.csv("neither.csv")

#calculate the model
model <- lm(y~x,data=data)
leverage <- ols_leverage(model) 

# plot 1 (left hand side)
plot(data$x,data$y,pch=16,xlab="x",ylab="y",main="Dataset C"); 
abline(model)

# plot 2 (right hand side). Remember to choose your own ylim
ols_plot_resid_stud(model)

There should be no absolute cut-off here (around 2-3 is a warning sign). Instead, take these as an opportunity to explore those points further. For example here is our plot with the residual:

# read the data
data1 <- read.csv("outlier.csv")

#calculate the model
model1 <- lm(y~x,data=data1)

# plot 1 (left hand side)
plot(data1$x,data1$y,pch=16,xlab="x",ylab="y",main="Dataset B"); 
abline(model1)

# plot 2 (right hand side). Remember to choose your own ylim
ols_plot_resid_stand(model1)

Here the plot is telling me that it thinks row 21 of the dataset might be an outlier

data1[21,]
##    x  y
## 21 4 40

1.18.4 Detecting influential points

If a point is both an outlier AND has leverage, chances are it will be influential over the fit.

OLSRR has a nice way of summarising both statistics. For example here are our four plots together:

# read the data
data <- read.csv("neither.csv")
data2 <- read.csv("outlier.csv")
data2b <- data2[-21,]
data3 <- read.csv("leverage.csv")
data3b <- data3[-25,]
data4 <- read.csv("influential.csv")
data4b <- data4[-25,]

#calculate the model
model <- lm(y~x,data=data)
model2 <- lm(y~x,data=data2)
model3 <- lm(y~x,data=data3)
model4 <- lm(y~x,data=data4)

# Set up 4 sub-plots one next to each other
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))

plot(data$x,data$y,pch=16,xlab="x",ylab="y",main="Dataset A",col="black") 
abline(model)

plot(data2$x,data2$y,pch=16,xlab="x",ylab="y",main="Dataset B",col="red") 
lines(data2b$x,data2b$y,pch=16,type="p")
abline(model2)

plot(data3$x,data3$y,pch=16,xlab="x",ylab="y",main="Dataset C",col="red") 
lines(data3b$x,data3b$y,pch=16,type="p")
abline(model3)

plot(data4$x,data4$y,pch=16,xlab="x",ylab="y",main="Dataset D",col="red",ylim=c(0,100)) 
lines(data4b$x,data4b$y,pch=16,type="p")
abline(model4)
Our four examples

Figure 1.27: Our four examples

# model A
ols_plot_resid_lev(model)

# model B
ols_plot_resid_lev(model2)

# model C
ols_plot_resid_lev(model3)

# model D
ols_plot_resid_lev(model4)

Cook’s Distance is another method for identifying influential points

Cook’s distance, \(D_i\), is measured by deleting one observation at a time and each time re-fitting the regression line on the remaining observations. By comparing the results from the n observations to the results with ith observation deleted, we can get an idea about the influence from the ith observation on the fitted model.

While there is no set cut-off for flagging a datapoint as an influential point, an observation with \(D_i\) would most likely be an influential point.

Again for our four examples:

# model A
ols_plot_cooksd_bar(model,type=3)

# model B
ols_plot_cooksd_bar(model2,type=3)

# model C
ols_plot_cooksd_bar(model3,type=3)

# model D
ols_plot_cooksd_bar(model4,type=3)

1.18.5 Making a table of your residual diagnostics

If your data is not huge, making a table of your residual diagnostics often makes it much easier to work out which points you are looking at.

In fact we can do this very easily simply by making new columns in our table of data. Let’s try it with our influential dataset.

testdata <- read.csv("influential.csv")
mymodel <- lm(y~x,data=testdata)

# Make a new column with the predicted y value
testdata$y_predicted <- mymodel$fitted.values

# Make a new column with the raw residuals
testdata$residuals_raw <- mymodel$residuals

# Make a new column with the standardisd/studentised residuals
library(MASS)
testdata$residuals_stud <- studres(mymodel)

# Make a new column with the leverage
testdata$x_leverage <- ols_leverage(mymodel)

# Make a new column with the Cook's distance. OLSRR package
testdata$cooks.distance <- cooks.distance(mymodel)

# Print out the table
head(testdata)
##         x       y y_predicted residuals_raw residuals_stud
## 1  7.4359 19.5433    18.86669     0.6766147     0.06528349
## 2 20.2564 30.3125    27.04441     3.2680919     0.31064806
## 3 14.6154 18.3894    23.44622    -5.0568228    -0.48479619
## 4  9.2308  7.2356    20.01159   -12.7759856    -1.27279635
## 5 34.3590 35.3125    36.03994    -0.7274351    -0.06901578
## 6 40.7692 60.3125    40.12876    20.1837353     2.11410204
##   x_leverage cooks.distance
## 1 0.07647225   0.0001844385
## 2 0.04447583   0.0023377280
## 3 0.05476095   0.0070421925
## 4 0.07013913   0.0594946602
## 5 0.04484163   0.0001168648
## 6 0.05732357   0.1180798516

If you click on the NAME of the table (e.g. testdata in the environment tab) then it will open the table in a new tab which makes it easier to view things.

How to sort the table in R-Studio. click the tiny arrows

Figure 1.28: How to sort the table in R-Studio. click the tiny arrows

If you’re sick of looking at things in R, you can save the table to a csv file and open it in excel. If you are running your project, this should save into the folder where your project file is (you can add a different directory if required)

write.csv(testdata,"Output.csv",quote=FALSE)

Finally, there might be too much data to look at things manually. You can now use commands like filter on your new table to find which rows are of interest

# Filter rows with high cooks distance
dplyr::filter(testdata,  cooks.distance>1)
##          x  y y_predicted residuals_raw residuals_stud x_leverage
## 1 101.5385 50    78.89125     -28.89125      -7.870108  0.5580226
##   cooks.distance
## 1       10.71394

If you decide to remove an outlier, you can do it in excel! (save the file as something new first, then read that back in).

Or you can remove a specific row with the - symbol

row_highcooks <- which(testdata$cooks.distance > 1)
print(paste("Row number: ", row_highcooks))
## [1] "Row number:  25"
newdata <- testdata[-row_highcooks]

## BE CAREFUL.  AS YOU REMOVE A DATA POINT ALL THE ROW NUMBERS WILL SHIFT AROUND, SO RUN THE WHICH COMMAND EACH TIME

Or you can simply use filter the other way around, but again be careful that you only remove the data you mean to

# Filter rows with high cooks distance
newdata <- dplyr::filter(testdata,  cooks.distance <= 1)

1.19 Tutorial 6B Transformations

We will talk in the class (or read here! https://online.stat.psu.edu/stat501/lesson/9 ) about why to apply transformations. Here I will simply show you how.

This is very similar to the method in 6a. We simply make a new column and use maths to fill in the value.

For example:

# read the data
data <- read.csv("neither.csv")
head(data)
##         x       y
## 1 10.2564 16.4663
## 2  3.0769  4.9279
## 3 15.1282 26.4663
## 4 27.4359 30.3125
## 5 12.0513 13.7740
## 6 23.0769 13.7740

To make a column with \(\frac{1}{x}\) as the predictor, we do

# Note I am trying to avoid spaces  or special characters in my data
data$transform_1overx <- 1/data$x

To make a column with \(x^2\) as the predictor, we do

# Note I am trying to avoid spaces  or special characters in my data
data$transform_xsquare <- data$x^2

To make a column with \(\log(x)\) as the predictor, we do (this is the NATURAL LOG)

# Note I am trying to avoid spaces  or special characters in my data
data$transform_lnx <- log(data$x)

To make a column with \(\log_{10}(x)\) as the predictor, we do

# Note I am trying to avoid spaces  or special characters in my data
data$transform_logx <- log(data$x,base=10)

To make a column with \(\sqrt{x}\) as the predictor, we do

# Note I am trying to avoid spaces  or special characters in my data
data$transform_sqrtx <- sqrt(data$x)

And let’s see the output

head(data)
##         x       y transform_1overx transform_xsquare
## 1 10.2564 16.4663       0.09750010        105.193741
## 2  3.0769  4.9279       0.32500244          9.467314
## 3 15.1282 26.4663       0.06610172        228.862435
## 4 27.4359 30.3125       0.03644859        752.728609
## 5 12.0513 13.7740       0.08297860        145.233832
## 6 23.0769 13.7740       0.04333338        532.543314
##   transform_lnx transform_logx transform_sqrtx
## 1      2.327902      1.0109950        3.202561
## 2      1.123923      0.4881134        1.754109
## 3      2.716561      1.1797873        3.889499
## 4      3.311852      1.4383192        5.237929
## 5      2.489173      1.0810339        3.471498
## 6      3.138832      1.3631775        4.803842

We could now perform our linear regression with any of these as our predictor. For example

newmodel <- lm(y~transform_sqrtx,data=data)
plot(data$y~data$transform_sqrtx,xlab="Sqrt(x) UNITS",ylab="y UNITS")

You don’t have to make a new column in advance, you can do the maths within the regression itself:

newmodel2 <- lm(y~sqrt(x),data=data)

# For Polynomial regression you can use the poly command e.g. X+X^2 

newmodel3 <- lm(y ~ poly(x,2),data=data)
summary(newmodel3)
## 
## Call:
## lm(formula = y ~ poly(x, 2), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.945  -6.344   1.280   6.887  15.652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   56.823      1.936  29.352  < 2e-16 ***
## poly(x, 2)1  144.687     10.244  14.124 2.03e-13 ***
## poly(x, 2)2   -7.937     10.244  -0.775    0.446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.24 on 25 degrees of freedom
## Multiple R-squared:  0.8889,	Adjusted R-squared:  0.8801 
## F-statistic:   100 on 2 and 25 DF,  p-value: 1.174e-12
equatiomatic::extract_eq(newmodel3)

\[ \operatorname{y} = \alpha + \beta_{1}(\operatorname{poly(x,\ 2)}_{\operatorname{1}}) + \beta_{2}(\operatorname{poly(x,\ 2)}_{\operatorname{2}}) + \epsilon \]

1.20 Tutorial 6C Basic model comparisons

You are now being asked to assess two (or more) models and decide “which is best”.

We will talk about this in more detail later in the semester, but for now, there are two easy ways we can compare different models.

Coefficient of Determination \(R^2\)

We could look first at the coefficient of variation for each model in the model summary. e.g. which model explains more variation in your response variable.

AIC

There is another measure called AIC (read more here: https://online.stat.psu.edu/stat501/lesson/10/10.5). For now, know that the lower the AIC, the “better” the model. This is a non parametric test that takes into account the number of predictors and the amount of data, so is often more robust to bad linear fits than \(R^2\) (which needs LINE to be true)

Let’s compare two models now, using our transformed data:

model1 <- lm(y~x,data=data)
model2 <- lm(y~sqrt(x),data=data)

model1summary <- summary(model1)
model2summary <- summary(model2)

# Adjusted R2
paste("Model 1:",round(model1summary$adj.r.squared,2) )
## [1] "Model 1: 0.88"
paste("Model 2:",round(model2summary$adj.r.squared,2) )
## [1] "Model 2: 0.87"
# AIC
AIC(model1,model2)
##        df      AIC
## model1  3 213.2457
## model2  3 216.4706

We can see here that although \(adj-R^2\) is higher for model 2, the AIC suggests that model 1 does a better job.

YOU SHOULD NEVER USE THESE NUMBERS IN ISOLATION. Also look at the data and use your common sense. Use them as guidelines.

1.21 Tutorial 7A Covariance matrices

THIS TUTORIAL SUMMARISES EARLIER MATERIAL FROM TUTORIAL 5B (included here for convenience)

Looking at correlations is a quick way to assess what is happening. Essentially we can look at the correlation between each column of data.

Importantly, remember back to this website - https://www.tylervigen.com/spurious-correlations. Just because another variable is correlated with our response does not mean it HAS to be in the model. It simply means that you might want to consider whether there is a reason for that correlation.

There are two easy packages to look at this data.

1.21.0.0.1 GGally package

The GGally function allows us to see the histogram of each column we choose to plot - along with the correlation of any numeric pairs. Let’s look at this for our house dataset from a previous lab.

# Quick display of two capabilities of GGally, to assess the distribution and correlation of variables 
house <- read_excel("Lab03_house.xlsx")

# Choose column names - let's say I don't care about location
colnames(house)
## [1] "House.Price"      "House.Age"        "Distance.Station"
## [4] "Number.Shops"     "Latitude"         "Longitude"
# Create plot - note I have message=TRUE and warning=TRUE turned on at the top of my code chunk
ggpairs(house[,c("House.Price" ,"House.Age","Distance.Station","Number.Shops" )])

1.21.0.0.2 Corrplot package

You can simply look at the correlations of any NUMERIC columns using the corrplot code, as we did in Lab 3.

library(corrplot)
house.subset <- house[ , sapply(house,is.numeric)]
corrplot(cor(house.subset),method="ellipse",type="lower")

1.21.0.0.3 Other packages

There are LOADS of other ways to run correlation plots here: https://www.r-graph-gallery.com/correlogram.html. Feel free to choose a favourite.

1.22 Tutorial 7B Mulitple Regression Models

Fitting a multiple regression model is just like fitting Simple Linear Regression. We simply add in the other variables you want to look at.

For example:

Simple linear regression between our house price response variable and house age (predictor) is

slr.model <- lm(House.Price ~ House.Age,data=house)
slr.model
## 
## Call:
## lm(formula = House.Price ~ House.Age, data = house)
## 
## Coefficients:
## (Intercept)    House.Age  
##     42.4347      -0.2515
equatiomatic::extract_eq(slr.model,use_coefs=TRUE)

\[ \operatorname{\widehat{House.Price}} = 42.43 - 0.25(\operatorname{House.Age}) \]

Multiple linear regression between our house price response variable and all our predictors is:

mlr.model <- lm(House.Price ~ House.Age + Distance.Station + Number.Shops ,data=house)
mlr.model
## 
## Call:
## lm(formula = House.Price ~ House.Age + Distance.Station + Number.Shops, 
##     data = house)
## 
## Coefficients:
##      (Intercept)         House.Age  Distance.Station  
##        42.977286         -0.252856         -0.005379  
##     Number.Shops  
##         1.297442
equatiomatic::extract_eq(mlr.model,use_coefs=TRUE)

\[ \operatorname{\widehat{House.Price}} = 42.98 - 0.25(\operatorname{House.Age}) - 0.01(\operatorname{Distance.Station}) + 1.3(\operatorname{Number.Shops}) \]

1.23 Tutorial 7C Residuals in MLR

Because all of our model assumtions centre around residuals, we can simply use the same code to assess LINE and outliers. E.g.

# Raw residuals
ols_plot_resid_fit(mlr.model)

# Studentized residuals
ols_plot_resid_stud(mlr.model)

# QQ plot, histogram and normality test of residuals
ols_plot_resid_qq(mlr.model)

ols_plot_resid_hist(mlr.model)

ols_test_normality(mlr.model)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.8786         0.0000 
## Kolmogorov-Smirnov        0.0858         0.0045 
## Cramer-von Mises         36.0211         0.0000 
## Anderson-Darling          7.4522         0.0000 
## -----------------------------------------------
# Cook's distance - you can change the thresholdfor "influential" using the type option
# See ?ols_plot_cooksd_bar for what type=3 corresponds to
ols_plot_cooksd_bar(mlr.model,type=3)

# Influential points (not you can't change the threshold here for influential)
ols_plot_resid_lev(mlr.model)

We can see that there are significant issues with this particular fit that we would need to explore!

As before, we can add all the model diagnostics to our original table, to allow us to query them more efficiently.

# Make a new column with the predicted y value
house$price_predicted <- mlr.model$fitted.values

# Make a new column with the raw residuals
house$residuals_raw <- mlr.model$residuals

# Make a new column with the standardisd/studentised residuals
house$residuals_stud <- studres(mlr.model)

# Make a new column with the leverage
house$x_leverage <- ols_leverage(mlr.model)

# Make a new column with the Cook's distance. OLSRR package
house$cooks.distance <- cooks.distance(mlr.model)

To view the table, type View(house) (or whatever the name of your table is) into the console, or click on its name in the environment tab.

1.24 Tutorial 7D Prediction and Confidence Intervals in MLR

These are again very similar, but we need to add in our other predictors in the new data.

Importantly when you make your “newdata” data frame, it needs to have values for all the predictor columns in your model.

e.g. let’s look at the error bars on the average estimated price of a house with these parameter…

newdata <- data.frame(House.Age=20, Distance.Station=500, Number.Shops=3)
# See ?predict.lm for more options
predict(mlr.model,newdata=newdata,interval="confidence",level=0.95)
##        fit     lwr      upr
## 1 39.12293 37.8721 40.37377

So we are 95% sure that the AVERAGE price of a house with these features in the underlying population is between 37.87 and 40.37. Unlike me, at this point you would also remember to add the units.

To predict the range of prices of a house that was not in our sample, it’s again very similar but now we need a prediction interval.

newdata <- data.frame(House.Age=20, Distance.Station=500, Number.Shops=3)
# See ?predict.lm for more options
predict(mlr.model,newdata=newdata,interval="prediction",level=0.95)
##        fit      lwr      upr
## 1 39.12293 20.89395 57.35191

So we are 95% sure that a house that is 20 years old, 500m from the station near 3 shops is between 20.89 and 57.83.