Chapter 3 Lab 2
3.1 General information
Welcome to STAT-462 lab 2. The aim of this week is to:
- Learn how to get help
- Create more professional markdown files
- Run some more statistical tests
- Conduct a more free ranging exploratory data analysis
The support you need to understand Lab 2 is in the Tutorials - READ THE TUTORIALS 2A,2B,2C,2D.
General comments
You do not have to submit things you try from the tutorials, you will just be graded on the lab 2 assignment.
Future labs will be less prescriptive about the formatting - but this is a detailed tutorial so that you learn how to set it all up well. Note, your answers should be written up in full sentences.
If running the labs is causing major problems for your computer or you have any other computer issue, talk to Dr Greatrex. We can fix this and there are other options for you to access R online. In general, please reach out to Dr Greatrex if you have any issue at all - we have likely see the errors hundreds of times before and we are happy to help.
There is also a canvas discussion board for this lab which will be the fastest place to get an answer. https://psu.instructure.com/courses/2115020/discussion_topics/13706472
3.2 Tutorials
This lab contains questions on markdown formats, hypothesis test and on exploratory data analysis. I encourage you to go and look at Tutorial 2A, 2B, 2C and 2D. All the answers are in there (for a different dataset).
3.3 Setting up your code
Save all your work and if you haven’t already, create a Lab 2 folder in your STAT-462 folder
Create a new R-project called Lab 2 that is linked to your Lab 2 folder (instructions in section 1.2, “Lab Challenge 1”). It will probably open a new version of R-Studio.
Create a new Markdown file called “username_Lab2” (for me it will be hlg5155_Lab2)
Remove the “friendly text” (see Tutorial 1E, section 1.5.3 if you have no idea what I mean)
3.4 Markdown formatting and YAML
Read through tutorial 2B
Follow the instructions to edit your YAML code to look like my example in Section 1.7.1.
Choose a new theme for your lab script! Section 1.7.1 includes a link to different themes.
Leave a blank line under your YAML code, then create a level-1 Heading called “Introduction”. Save and make sure that it will preview/knit.
Leave a blank line under that then write something new you learned about R or statistics this week. Use bold and italic fonts in your answers.
Make a new code chunk. Inside here, you can add all the libraries you need (if you do not have them, you can install them using the tutorial from last week). Start by entering these, but if you need any more packages you can add them here and rerun the code chunk. Run the code chunk a few times to make sure they load with no errors.
3.5 Confidence intervals
Tutorial 2D will be helpful here & your lecture notes. Along with the tutorial, this is a good example using R to calculate confidence intervals: https://www.cyclismo.org/tutorial/R/confidence.html
Leave a blank line after the code chunk, then make a new level-1 heading called “Confidence Intervals”.
Leave a blank space (I’ll stop now but you get the idea), and make a Level 2 heading called “Question A”.
Answer question A below, making sure to use full sentences in your conclusions. If it helps to make your report easier to read, feel free to include the question text.
Question A: A sample of 36 obese rock-hopper penguins in a zoo were put on a special diet for a year. The average weight loss was 11oz and the standard deviation of the weight loss was 19oz. (note, that a positive weight loss implies reduced weight over time).
Either by hand or in R, calculate the 99% confidence interval for the true mean weight reduction. Make sure to show your workings or R code.
- Make a Level-2 heading called Question B and answer question B below.
Question B: Based on the interval you calculate above, do you have sufficient evidence at your 99% level of significance to believe that the weight-loss programme is working and the penguins are losing weight? The average penguin actually weighs about 3Kg. Is this diet something you would recommend for meaningful weight loss?
3.6 Hypothesis testing
Tutorial 2D will be helpful here along with your lecture notes.
Make a new level-1 heading called Hypothesis Testing, then make a Level 2 heading called Question C - and answer:
Question C: Tests are being carried out on a new drug designed to relieve the symptoms of the flu, specifically on the number of hours people can sleep. The new drug is given in tablet form one evening to a random sample of 16 people who have colds. The number of hours they sleep may be assumed to be Normally distributed and is recorded below.
There is also a very large control group of people who have colds but are not given the drug. The mean number of hours they sleep is 6.6 hrs.
## [1] Hours slept by people given the new drug
## [1] 2.3
## [1] 5
## [1] 6
## [1] 6.4
## [1] 6.7
## [1] 6.7
## [1] 6.9
## [1] 7.2
## [1] 7.2
## [1] 7.4
## [1] 7.6
## [1] 7.8
## [1] 7.9
## [1] 8.1
## [1] 8.1
## [1] 9.7
You can enter the sleep data into R using this code.
By hand, carry out a hypothesis test at the 1% significance level that the drug has any impact on the length of time people sleep. You can use R as a calculator to get things like the mean. Make sure to include:
- Your H0 and H1
- The critical threshold
- Whether it is one sided or two sided
- Whether you choose to use the normal or t-distribution
- A diagram of the distribution split into the acceptance and critical/rejection zones
- The calculated test statistic
- and your conclusions
Include a screenshot of your [neat] workings in this report. If you can’t remember how to do this, see step 10 in section 2.6 (Lab-1 Challege 3)
- Make a Level 2 heading called Question D and answer question B below
Question D: Use R and the t.test command to calculate the t-test for the data above. Comment on whether your two results agree (e.g. did you make a mistake anywhere).
3.7 Exploratory data analysis
Tutorial 2C will be useful here.
Make a new level-1 heading called Exploratory Data Analysis. You are now going to explore the Credit dataset, which is a simulated dataset used for educational purposes. To start out, make a new code chunk and include following commands.
Question E: Use & interpret the output of R to describe the dataset to me. How much data is there and what does the dataset show? What variable names are there?
Note, the help file is incorrect for this dataset. There are not 10,000 rows of data so you’ll need to work out how big the dataset really is.
Question F: Create a table of the number of students vs not students (e.g. a table of the Student column - Section 2.3.6) **.
Question G: Choose a single numeric variable/column in the dataset (your choice). Tell me its summary statistics & units and include a professional looking plot of its distribution. Via this and via a Wilks Shapiro test (including your H0,H1, conclusions etc) assess whether that variable is normally distributed.
Question H (OPTIONAL):[BONUS 2%, lab capped at 100%] Subset/filter the data to select only rows containing students. (Section 2.3.5). Find how the mean of your variable has changed.
3.8 Submitting Lab 2
Remember to save your work throughout and to spell check your writing (next to the save button). Now, press the knit button again. If you have not made any mistakes in the code then R should create a new html file which includes your answers. This can be found in your Lab 2 folder (see Section 1.7 for where)
Check your html is complete by double clicking on to open it in your web-browser.
Now go to Canvas and submit BOTH your html and your .Rmd file in Lab 2.
3.8.1 Lab 2 submission check
HTML FILE SUBMISSION - 5 marks
RMD CODE SUBMISSION - 5 marks
MARKDOWN/CODE STYLE - 5 MARKS
LOOK AT YOUR HTML FILE IN YOUR WEB-BROWSER BEFORE YOU SUBMIT. You included the headings/sub-headings as requested. Your code and document is neat and easy to read. There is also a spell check next to the save button.
WRITING STYLE - 5 MARKS
You have written your answers below the relevant code chunk in full sentences in a way that is easy to find and grade. You included the bold/italic R fact.
QUESTION A - 10 MARKS Penguin confidence interval
You answered the questions correctly either by hand or in R, but you also showed your workings/code
QUESTION B - 10 MARKS Penguin confidence interval interpretation
You interpreted the questions correctly
QUESTION C - 20 MARKS
You accurately conducted the hypothesis test with all the requested steps included and managed to upload the screenshot.
QUESTION D - 5 MARKS
You reproduced your t-statistic, p-value & conclusion in R.
QUESTION E - 5 MARKS
You described the data in a way that someone who had never seen the data before would be able to understand what it shows and how big it is
QUESTION F - 5 MARKS
You created the summary table and interpreted the output
QUESTION G - 20 MARKS
You summarised a variable of your choice and included all the requested statistics/plots.
[100 marks total]