The aim of this lab is to start looking at tables of data and to apply z-scores and t-tests on frost days in the USA.
We will also be creating a lab script template that makes life easier in the future. As before, much of this is based on tutorials that I will point you to as we go through. By the end of this lab you will be able to:
Assignment 2 is due by midnight the night before your next lab on Canvas. Your job is to submit the requirements on this page.
See this page or go to canvas for assignment guidelines.
Need help? Add a screenshot/question to the discussion board here: LAB 2 DISCUSSION BOARD
Setting up your markdown files takes time and can get repetitive. Making a template skips a lot of future set up.
To to this, we are going to follow these steps:
First, we want to set up R in the same way as Lab 1, creating a project file and a blank markdown document.
GEOG364_Lab2_PROJECT
.GEOG364_Lab2_PROJECT
folder.readxl
, usmap
, viridis
and the ggstatsplot
packages.GEOG-364 Lab 2
and set the author name to your USER-ID e.g. hlg5155GEOG364_Lab2_userID_CODE.Rmd
e.g. for me GEOG364_Lab2_hlg5155_CODE.Rmd
Your R-Studio screen should look something like this. IF NOT, STOP AND TALK TO A TEACHER.
Our output still looks pretty boring when we knit. Work through Markdown Tutorial 4C, D and E to understand and edit your YAML code.
Use the tutorial to change your YAML code to match the text in the tutorial. Your code should now include at a minimum
We downloaded & installed many libraries last week. As discussed, even though they are downloaded, we need to load them EVERY TIME we want to use them (in the way you click on an app on your phone before you can use it).
It’s best to do this at the the start of each script, so let’s do this now.
# Load Libraries
.Press enter a few times, then make a new code chunk containing this code.
library(tidyverse)
library(sp)
library(sf)
library(readxl)
library(skimr)
library(tmap)
library(USAboundaries)
library(viridis)
Press knit. You should see that all the “welcome text” has come back, making your report look unprofessional.
Let’s remove this by editing our code chunk options.
Your lab script should now look like this, but with your theme and YAML options of choice (you might have a few different libraries than in my screenshot). You should also be able to knit it successfully. If not, go back and do the previous sections!
You can use steps A1-A5 for every new lab to set up your lab script. But it is much easier to save the file we have just created as a template, then in future labs we can just make a copy. To do this quickly:
GEOG364_Lab2_userID_CODE.Rmd
(with your ID)GEOG364_TEMPLATE_userID_CODE.Rmd
.Now, the lab! This section is based on data from Chapter 6 of the textbook. Specifically we are going to conduct some exploratory data analysis on average last spring frost dates across the South East USA.
To to this, we are going to follow these steps:
Before we touch any data, it’s important to start with words. We need to summarise what we already know about the dataset, the population under study and any important context.
The aim of this lab is to analyse the “average last frost dates” obtained from weather stations across the South Eastern USA. E.g. what day of year, on average, is the final day to get frost. See chapter 6 of the textbook for a brief summary.
The textbook (and us) are using data obtained from this dissertation to assess the spatial distribution of average spring frost dates: Parnell, 2005, “A Climatology of Frost Extremes Across the Southeast United States, 1950–2009”:
https://www.proquest.com/openview/d5a7301f0cbe941ead48c96888f791b8/1?pq-origsite=gscholar&cbl=18750&diss=y
Last Spring Frost
and press enter a few times.I’m going to stop saying, “press enter a few times”, or “add a blank line” now. Keep doing it!Note,there is a spell check next to the knit button at the top of the script and press knit regularly to check it all looks good
Go to your lab 2 folder. Double click frostdata.xslx
to open it in excel and take a look at the data.
We now want to load this into R.
You should see a spreadsheet/table/data.frame with these columns:
When I say Day-Of-Year, I mean a number from 1-365 representing the month/day of the year e.g. Jan-1:1, Jan-2:2… Jan-31:31, Feb-1:32… Dec-31:365. We use this number instead of the month/day because its easier to analyse.
frost
.View(frost)
.Make a new level-2 heading called Summary Statistics
e.g ## Summary Statisics
Create a new code chunk. Apply the glimpse()
command to the frost variable
Create a new code chunk. Apply the skim()
command to the frost variable
In the text below the code chunk, use your knowledge from your reading, this lab script and the skim command output, to summarise the dataset for someone who has never seen it before (e.g. FULL SENTENCES), including:
Sometimes we want to deal with only one specific column in our spreadsheet/dataframe, for example applying the mean, standard deviation, inter-quartile range command to say just the distance_to_the_coast.
To do this, we use the $ symbol. For example, here I’m simply selecting the data in the elevation column only and saving it to a new variable called elevationdata.
elevationdata <- frost$Elevation
Try it yourself. You should have seen that as you typed the $, it gave you all the available column names to choose from.
This means we can now easily summarise specific columns. For example:
summary(frost)
will create a summary of the whole spreadsheet,summary(frost$Longitude)
will only summarise the Longitude column.mean(frost$Dist_to_Coast)
will take the mean of the Dist_to_Coast column in the frost dataframe.min
command.min
command to the Dist_to_Coast
column of the frost dataframemedian
command to the Avg_DOY_SpringFrost
column of the frost dataframe (bonus - explain what the median is)Make a new level 2 sub-heading called Group statistics
.
Sometimes we want to count the occurrences of some category in our dataset. For example, if you look at the frost dataset, it might be interesting to know how many stations were in each US State. To do this, we use the table command: We can use the table command to assess how many rows of our data.frame/spreadsheet fall into different groups. See Tutorial 7B for how to use it and some online tutorials.
State
column vs the Type_Fake
column.What if we want to do more than just count the number of rows? Well, we can use the group_by()
and summarise()
commands and save the answer to a new variable.
See Tutorial 7C for a worked example.
State
column,Avg_DOY_SpringFrost
)frost.summary.state
.Sometimes it’s just nice to visualise a distribution, using a histogram. You can see tiny mini ones for each variable/column in the output of the skim command, but let’s make something more professional.
For many applications, we also want to assess whether our data has a specific distribution or not (for example, normal, exponential..). We can do this using a plot called a QQ-plot (quantile-quantile plot) and statistical tests such as the Wilks-Shapiro test for normality.
This is covered in Tutorial 9A
Make a new level-2 heading called Z-Score mapping
In the lecture, we talked about finding z-scores (e.g. how unusual a value is AKA how many standard deviations away from the mean). Let’s calculate a new column in our dataset with the z-score of the last spring frost date.
Create a new code chunk and copy/run this code.
mean.frost <- mean(frost$Avg_DOY_SpringFrost)
sd.frost <- sd(frost$Avg_DOY_SpringFrost)
frost$ZScore <- (frost$Avg_DOY_SpringFrost - mean.frost) / sd.frost
Hopefully you can see that we’re just applying the Z-score equation to each row of the Avg_DOY_SpringFrost column of the frost dataset:
Have a look at the frost table itself (e.g. click on its name in the environment tab), or look at the summary. you should see the new column. Essentially the ZScore column now contains a normalised value of how unusually early or late that location is.
Make a new level 2 sub-heading called Mapping
.
So far, we have ignored the fact our data has a location! We will cover this in more detail in lab 3, but for now, let’s make a quick map. To do this, we need to make R realise that our data contains spatial coordinates.
Create a new code chunk and add this code:
Create a code chunk and copy/run this code. WE will discuss this in detail next week.
frost.sf <- st_as_sf(frost,coords = c("Longitude", "Latitude"),crs=4326)
The QTM command allows us to make quick interactive plots.
Create a code chunk and copy/run this code. It should make an interactive map of the Elevation column. Tweak the code to map your z-scores AKA recreate the map in Chapter 6 of the textbook.
tmap_mode("view")
qtm(frost.sf,dots.col="Elevation")
qtm(frost.sf,dots.col="Elevation")
for the elevation column.Now it’s your turn!
The other file in your Lab 2 folder should be called NewYork.xlsx. It contains real airbnb price data across New York City for 2021 from this source (http://insideairbnb.com/get-the-data.html - I did some quality control)
Use what you have learned to read the data into R, explore it and tell me what you find.
DO NOT SPEND ALL DAY ON THIS!
Set a timer for 45mins, turn off the internet and see how far you get. That is enough (if you do more, that’s fine, but about 45mins work (MAX) is all I expect here)
Hint, follow the same steps. AKA
NewYork
This is brand new data that I have not looked at myself. There will be many interesting stories in it, so try to find your own path through the data (e.g. try to take a different route to anyone you work with closely)
Good luck!
Remember to save your work throughout and to spell check your writing (left of the knit button). Now, press the knit button again. If you have not made any mistakes in the code then R should create a html file in your lab 2 folder which includes your answers. If you look at your lab 2 folder, you should see this there - complete with a very recent time-stamp.
In that folder, double click on the html file. This will open it in your browser. CHECK THAT THIS IS WHAT YOU WANT TO SUBMIT
Now go to Canvas and submit BOTH your html and your .Rmd file in Lab 2.
You lose points on a sliding scale as things don’t work out or are missed
HTML FILE SUBMISSION - 7 marks
RMD CODE SUBMISSION - 7 marks
MARKDOWN/CODE STYLE - 16 MARKS
MARKDOWN LEVEL-UP: YOU MADE YOUR YAML WORK: 10 MARKS
FROST DATE: ALL THE WORDS: 10 MARKS
FROST DATE: SUMMARY STATISTICS: 10 MARKS
FROST DATE: TABLES & SUMMARYS BY GROUP: 10 MARKS
FROST DATE: PLOTS: 10 MARKS
NEW YORK: EXPLORATION: 20 MARKS
[100 marks total]
Overall, here is what your lab should correspond to:
Grade | % Mark | Rubric |
---|---|---|
A* | 98-100 | Exceptional. Not only was it near perfect, but the graders learned something. THIS IS HARD TO GET. |
NA | 96+ | You went above and beyond |
A | 93+: | Everything asked for with high quality. Class example |
A- | 90+ | The odd minor mistake, All code done but not written up in full sentences etc. A little less care |
B+ | 87+ | More minor mistakes. Things like missing units, getting the odd question wrong, no workings shown |
B | 83+ | Solid work but the odd larger mistake or missing answer. Completely misinterpreted something, that type of thing |
B- | 80+ | Starting to miss entire/questions sections, or multiple larger mistakes. Still a solid attempt. |
C+ | 77+ | You made a good effort and did some things well, but there were a lot of problems. (e.g. you wrote up the text well, but messed up the code) |
C | 70+ | It’s clear you tried and learned something. Just attending labs will get you this much as we can help you get to this stage |
D | 60+ | You attempt the lab and submit something. Not clear you put in much effort or you had real issues |
F | 0+ | Didn’t submit, or incredibly limited attempt. |
Website created and maintained by Helen Greatrex. Website template by Noli Brazil