Learning objectives

When you finish your studies at Penn State, I hope you will be well equipped to face the world ahead. One of those skills will be choosing the best wine :) We are going to look at this problem as a class - although you will all have different training data.

The learning objectives are putting what you have learned together and making your own reference guide on regression for the future. We will also look at model fitting.

There is a TEAMS discussion for lab help CLICK HERE.

Here is the canvas page: https://psu.instructure.com/courses/2174925/assignments/

Assignment 8 has TWO deadlines.
Before the final lecture April-29, you must have uploaded model parameters to Canvas.
The final report is due the Wednesday of exams week

I PROVIDE HELP UNTIL THE END OF THE SEMESTER (April 29th). Exams week is for your own finishing up.

RED wine prediction

This lab will examine the topic of Red wine quality from the Vinho Verde vinyard, from the Minho (northwest) region of Portugal. This wine accounts for 15% of total Portuguese production. Data was collected from May/2004 to February/2007 using only protected designation of origin samples that were tested at the official certification entity (CVRVV). The CVRVV is an inter-professional organization with the goal of improving the quality and marketing of Portugese wine.

Most data was recorded by a computerized system (iLab), which automatically manages the process of wine sample testing from producer requests to laboratory and sensory analysis. Physicochemical laboratory tests routinely used to characterize wine include determination of density, alcohol or pH values. If you wish to know more about these tests you can see detailed info here: https://www.oiv.int/public/medias/2468/oiv-ma-as2-01b.pdf

Other tests rely on human experts. To assess wine quality, each sample was evaluated by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent).

For more details on this (real) study, consult: https://www.vinhoverde.pt/en/statistics or the reference [Cortez et al,2009 - https://link.springer.com/chapter/10.1007/978-3-642-04747-3_8].

Overall task

I wish to understand what characteristics of red wines predict high quality. Each of you has been awarded an individual training sample of 300 Red wines.

Unit of analysis: A single sample of Red wine from the vineyard
Population: All Red wines from this vineyard
Sample: 300 wine samples taken between 2004 and 2007
Response variable: ‘quality’ (see above for details)

You also have access to these variables which could potentially be used to predict quality.

alcohol (% Vol)
fixed_acidity (\(g(tartaric.acid)/dm^3\))
citric_acid ( low | medium | high)
residual_sugar (\(g/cm^3\))
chlorides (\(g(sodium.chloride)/dm^3\))
free_sulfur_dioxide (\(mg/dm^3\))
total_sulfur_dioxide (\(mg/dm^3\))
density (\(g/cm^3\)) (mass per unit volume of wine or must at 20°C)
pH
sulphates (\(g/cm^3\))

Your task is to use your personal sample to predict wine quality. To the best of your ability need to find a model that is both valid (e.g. meets LINE) and ‘fits’ the best without over-fitting.

I have a ‘new’ dataset of 200 red wines with the same variables. Once you are happy with your model, you will then predict the quality of those 200 wines and upload the model and your errors to canvas. We will then see as a class the range of models that you come up with.

This is real data, there is no right answer and you will get a different result to your classmates as you have individual training data.

Specific instructions

Set up the study

Step 1: SET-UP

Create a new project for Lab 8 and copy your lab template to your lab 8 folder | Rename and open. | In the library section, add a new code chunk and use this code to load the libraries below. You WILL need to install some first. | Hint, NEVER put install.packages in your code chunk, run it in the console. | Finally, press knit to check the html works and your theme works.

If you struggle with any of this step, see previous labs and the tutorials for detailed help. Or ask for help.

# Load libraries
library("tidyverse") # Lots of data processing commands
library("knitr")     # Helps make good output files
library("ggplot2")   # Output plots
library("rmarkdown") # Helps make good output files
library("lattice")   # Makes nice plots
library("RColorBrewer") # Makes nice color-scales
library("skimr")     # Summary statistics
library("Stat2Data") # Regression specific commands
library("corrplot")  # correlation plots
library("GGally")    # correlation plots
library("ggpubr")    # QQplots
library("olsrr")     # Regression specific commands
library("plotly")    # Interactive plots
library("readxl")    # Read from excel files
library("equatiomatic") # extract equations
library("ggstatsplot")
library("MASS")

# NEW
library(DescTools)
library(glmnet)

## you may need additional libraries.  Just add them to this list if you get errors.

Step 2: READ IN THE DATA

Read both your personal training data AND the test data into R. Call the training dataset TrainData and the test dataset TestData.

Step 3: STUDY DESCRIPTION

In the Data Description and Study Aim sections, use the first part of the teaching notes and the overview to help you write about the following:

The background of the study
A short description of the data.
The object of analysis
A reasonable population
The response variable
A bullet point list the other variables that could potentially be a predictor.
A description of your sample
Statements to answer the questions on page 3 of the teaching notes
A short paragraph to summarise the study (Example for penguins on page 3 of the teaching notes)

I have given you much of this, so you simply need to summarise in your own words, but including more about wine and the problem will gain extra marks. Essentially, make this a report you would be happy showing in a job interview. You are welcome to change the template in any way you like but you will be graded on clarity.”.

Exploratory analysis

Step 4: QUALITY CONTROL. YOU SHOULD BE USING THE TRAINING DATASET.

Create a section called quality control. There are issues and errors within your training dataset which you will find as you explore the data. If you choose to remove any data points, summarise which ones you removed and why. Run all your models below on the final clean data.

Note, your response variable is integer numeric (numbers from 1-10) making histograms a little weird. There are a few ways to approach this dataset. One would be to consider the data as categorical ordinal and to use a machine learning approach or ordinal logistic regression. To consider this data for our type of classical regression, we need to satisfy a few things related to LINE assumptions.

There is equidistant spacing between the levels. E.g. “bad” “medium” “good” is hard to quantify, but a scale from 1-10 is just about OK.
Second, the distribution of the ordinal responses may be severely non-normal (e.g. people often chose 1 or 10 but never 5). In this case, the data is approximately normal (it will fail the wilks-shapiro but should be approximately normal)
There is approximately equal variance at each level. Essentially, looking at LINE assumptions is even more important

In our case, it would be reasonable to imagine our response as a sliding scale (e.g. slide the slider between 0 and 10) which would give a continuous number and it reasonably satisfies the assumptions above, so I am OK using this for regression. Other approaches are covered in the Cortez paper.

Step 5: EXPLORATORY ANALYSIS. YOU SHOULD BE USING THE TRAINING DATASET.

This is exactly what you were doing in labs 1-7. Just exploring the data. See above for why it looks weird. You can also include things like correlation matrices. Any plots should look professional and you should explain any plots/results (e.g. explain to me what a correlation matrix is and any interesting results, don’t just put it in).

It might overlap with quality control, that is 100% fine.

Decide on your favorite model

Step 6: MODEL BUILDING. YOU SHOULD BE USING THE TRAINING DATASET.

I INCLUDE THE CODE BELOW

It is now up to you to find the “optimum” model to predict the quality of red wine in your training dataset. To choose models, you could use your understanding of the situation, the correlation matrix, best-subsets stepwise regression, or anything else

This should - Meet L.I.N.E (+multicollinearity) assumptions as much as possible, - You should have explored any outliers and - It should have high model accuracy compared to your other models - You do not have to go above and beyond here (unless you want to). E.g. you do not need GLMS etc. . In your report, should include details of at least two models. For each model,

Make a sub-section and name the model
Create the model and include its equation
Assess model accuracy / “goodness of fit” (hint )
Assess LINE and DISCUSS EACH L.I.N.E ASSUMPTION IN THE TEXT.
Discuss if there are any spurious/confounding/lurking variables. E.g. are there things in your model that you don’t need, or is there something missing.

As an example, a report might look like

Model 1:

I chose this model through my knowledge of the situation
model_code
Does it meet L.I.N.E.
- If not, what can I change?
- If yes, how accurate is it?
State way forward: e.g “This model failed linearity, so I will try a transformation for model 2” | or “i wonder if there are variables I missed..”

Model 2:

I chose this model through [reason above]
model_code
Does it meet L.I.N.E.
- If not, what can I change?
- If yes, how accurate is it and how does it compare to previous models.
State way forward for model 3…

Repeat until you are happy. To help with this, here is a summary of the code I use for each model. There are many options available in your past labs and the tutorials, so feel free to edit/change.

# need to edit this each time. and I normally change "mymodel" a name that makes sense.
mymodel <- lm ( ResponseCol ~ PredictorCol1 + PredictorCol1,  data= tablename)

# Add asis=TRUE in the r curly brackets to automatically make the equation as you knit.
equatiomatic::extract_eq(mymodel,use_coefs = TRUE,coef_digits = 4)

# LINE -----------------------
# Residuals vs fitted
ols_plot_resid_fit (mymodel)

# Equal variance test (lecture 24)
lmtest::bptest(mymodel)

# Normality of residuals
ols_plot_resid_hist(mymodel)
ols_test_normality(mymodel)

# Outliers
ols_plot_resid_lev(mymodel)

# Outliers
ols_plot_comp_plus_resid(mymodel)

# multicollinearity
car::vif(mymodel)

# Summary / goodness of fit -----------------------
#summary(mymodel)
ols_regress(mymodel)
anova(mymodel)

# Comparison  -----------------------
anova(mymodel1 , mymodel2)
AIC(mymodel1, mymodel2)

# Automated model building  -----------------------
ols_step_best_subset(mymodel)

Step 7: COMPARING MODELS. YOU SHOULD BE USING THE TRAINING DATASET.

This overlaps with the previous section. But include a small section comparing at least two models using AIC and an anova F-test. Remember to write out the test fully (e.g. include H0, H1, test statistic, p value, conclusion).

Step 8a: SUMMARISE FAVOURITE MODEL. YOU SHOULD BE USING THE TRAINING DATASET.

Finally, briefly summarise your final choice of model, including:

The model summary from R
The equation
Explain to me which predictors have the biggest impact on quality and which are most/least significant.
If there are categorical predictors in your model, explain what they mean.
A sentence or two to summarise whether it met the LINE assumptions.
The adjusted R^2

Step 8b: SEE HOW GOOD YOUR MODEL IS AGAINST ITSELF (e.g. another way of plotting residuals)

We can see how good your model is by plotting your predicted quality against the actual quality.

Add a column of the modelled values for your training data using this command, replacing ‘mymodel’ with the name of your favorite model.

TrainData$quality_predicted  <- FINALMODEL$fitted.values
TrainData$residual           <- TrainData$quality - TrainData$quality_predicted

RMSE_training <- sqrt(mean(TrainData$residual^2))

Make a professional plot of quality vs predicted quality from your training data.

Predicting a completely independent dataset

Step 9: PREDICTING AN INDEPENDENT DATASET. YOU SHOULD NOW USE THE TEST DATASET.

I have withheld 200 values from your datasets that I set aside in my “test data”. They come from the same population. We are now going to see how your model performs on the new data.

As well as creating confidence and prediction intervals, we can use the predict command to predict many new values. We can summarise the size of the predicted residuals using the Root Mean Squared Error of prediction (RMSE)..

Assuming you read in your TestData in step 1

# THIS IS FOR THE NEW TESTDATA
TestData$quality_predicted <- predict(FINALMODEL ,newdata=TestData)
TestData$Residual   <- TestData$quality - TestData$quality_predicted

RMSE_test <- sqrt(mean(TestData$residual^2))

Make a professional plot of quality vs predicted values and state the RMSE for your test dataset. Comment on how this model appears to fit on the test data.

Step 10: Upload values. GO TO CANVAS

See the Canvas quiz. For your favourite model, enter the values of each model coefficient. Plus your final RMSE FOR THE TEST DATA. We will see the range of values obtained across the class.

“Show me something new”, beyond MLR

This is worth 5 marks as always.

One of the problems with Simple Linear Regression is that it tends to overfit to samples. In this new world of powerful computers, there are now several new techniques that go beyond Classical Linear Regression, for example LASSO, Ridge Regression and Elastic Net. These penalise variables that have a small effect.

Here are two great tutorials about LASSO, ridge and elastic net.

Here is the code to run LASSO on your training data. Explain what LASSO is, see if you can get this up and running and write how it differs from the “best” model you found.

library(glmnet)

#define response variable
y <- TrainingData$quality

#find predictor columns
predictor_cols <- which(names(TrainingData) %in% c("alcohol" ,"fixed_acidity" ,
                                 "citric_acid","residual_sugar" ,"chlorides" , 
                                 "free_sulfur_dioxide" , "total_sulfur_dioxide","density", 
                                 "pH"  ,"sulphates"  ))

#define matrix of predictor variables
x <- data.matrix(TrainingData[, predictor_cols])

#perform k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(x, y, alpha = 1)

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min

#find coefficients of best model
LASSO_best_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)

print("Lasso model coefficients")
coef(LASSO_best_model)

#predict
predictor_cols <- which(names(TrainingData) %in% c("alcohol" ,"fixed_acidity" ,
                                 "citric_acid","residual_sugar" ,"chlorides" , 
                                 "free_sulfur_dioxide" , "total_sulfur_dioxide","density", 
                                 "pH"  ,"sulphates"  ))

testx <- data.matrix(TestData[, predictor_cols ])

TestData$LASSOpredict <- predict(LASSO_best_model, s = best_lambda, newx = testx)
TestData$ResidualLASOO <-TestData$quality - TestData$LASSOpredict

RMSE <- sqrt(mean(TestData$ResidualLASOO^2))
print("RMSE of test data")
RMSE

Submitting your Lab

Press the knit button for the final time. If you have not made any mistakes in the code then R should create a html file in your lab 8 folder which includes your answers. If you look at your lab 8 folder, you should see this there - complete with a very recent time-stamp. In that folder, double click on the html file. This will open it in your browser. CHECK THAT THIS IS WHAT YOU WANT TO SUBMIT.

Grading Rubric/checklist

See the table below for what this means - 100% is hard to get!

HTML FILE SUBMISSION - 10 marks

RMD CODE SUBMISSION - 5 marks

Professional report 15 MARKS

Full marks for a report that I would take into a job interview. You have done things like fully labeled plots using words, tried more sophisticated plots than just the basics, written full paragraphs/sentences, used equation formats, sub-headings, used spell check, explained results in clear language, included units, used a theme and table of contents.. Lose marks for each thing that makes it look non-professional.

Describe the data and EDA - 15 MARKS

You have explored the data using the guide, conducted quality control where you removed the observation that was clearly wrong and written up your results clearly. You have created the correlation matrix plot and sensitively described the relationship between your response and your predictors.

Modelling - 30 MARKS

You clearly created, summarised and assessed each model. It was clear what your strategy was for a favourite.

Favorite and comparison - 20 MARKS

You properly compared the models, made your prediction and uploaded your results.

LASSO - 5 MARKS

You got the code running and explained what LASSO was and your results.

See above

[100 marks total]

Overall, here is what your lab should correspond to:

Grade	% Mark	Rubric
A*	98-100	Exceptional. Not only was it near perfect, but the graders learned something. THIS IS HARD TO GET.
NA	96+	You went above and beyond
A	94+:	Everything asked for with high quality. Class example
A-	90+	The odd minor mistake, All code done but not written up in full sentences etc. A little less care
B+	87+	More minor mistakes. Things like missing units, getting the odd question wrong, no workings shown
B	84+	Solid work but the odd larger mistake or missing answer. Completely misinterpreted something, that type of thing
B-	80+	Starting to miss entire/questions sections, or multiple larger mistakes. Still a solid attempt.
C+	77+	You made a good effort and did some things well, but there were a lot of problems. (e.g. you wrote up the text well, but messed up the code)
C	70+	It’s clear you tried and learned something. Just attending labs will get you this much as we can help you get to this stage
D	60+	You attempt the lab and submit something. Not clear you put in much effort or you had real issues
F	0+	Didn’t submit, or incredibly limited attempt.

Lab 8: Putting it all together

STAT-462 - Regression Analysis

Dr Helen Greatrex