in_project1_instructions.knit

Introduction

Guardian Health is a new start-up company who is preparing to enter the US health insurance market in 2026. They need to set competitive yet sustainable premiums for their plans and customers across the United States but because they are new, they lack historical claims data. Instead, they ave contracted our consulting firm, 462-Investigations to use the Government’s MEPS longitudinal dataset (2022) to build a predictive model of total annual healthcare costs (TOTEXP22).

As a junior analyst in STAT462-Investigations, your role on our team is to

Conduct the background research on the dataset and explore the data
Use Simple Linear Regression to answer Guardian Health’s questions about the expected annual medical expenses for their customers

What will you submit?

You have already completed a quiz about Guardian health and the project (worth 20 points)
You get 30 bonus points (previously the peer review) for participating in this roller coaster of a project
By 23.59pm WEDNESDAY 23rd April, I expect you to submit TWO reports to submit to Guardian Health. (Worth 180 points):

Final Report (Client-Facing Document)
This is a professionally formatted consulting report, ensuring that all results are interpreted and explained clearly for a non-statistical audience, including
- An introduction to the MEPS-2022 dataset, explaining its purpose, structure, and limitations.
- A summary of exploratory data analysis (EDA) and data quality checks.
- A full description of your Simple Linear Regression (SLR) models to answer the client questions, including rationale, key assumptions, and results.
- A conclusion discussing limitations and caveats, ensuring Guardian Health understands the scope and reliability of these predictions for the company’s launch in 2026..
Modelling Logs (Technical appendix/background workings)
This is a record of your entire workflow and decision-making process, so I can see everything you tried, but you can keep the final report neat.
The logs can be written in a more casual style (including non English languages if English is your second language) and is designed to include things like
- All your messy quality control and exploratory data analysis
- Trying a load of models, then choosing your favorite for the main report
- All your checking/double checking assumptions, fixing transformations etc

All the instructions to create these reports are included below, and there is a detailed rubric at the end.

Task 1. Create professional documents

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 1&2 WORK

1.1 Create your project

1.2 Create your Lab reports

1.3 Add code-chunk options

It’s useful to stop warnings and messages being displayed. Try putting a code chunk looking like this at the top of BOTH of your reports (below the YAML code at the top).

```{r, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
```

1.4 Add libraries to both reports

You will need to load all the apps/libraries/packages you need (all different names for the same thing). I suggest putting a code chunk, below your YAML code but at the top of your script, containing all the libraries you need. Here are a few I suggest along with the associated code chunk options, but please add more as needed.

```{r, include=FALSE, message=FALSE, warning=FALSE}
#-------------------------------------------------------------
# Libraries
#-------------------------------------------------------------
 library(tidyverse)
 library(remotes)
 library(ggplot2)
 library(ggstatsplot)
 library(dplyr)
 library(writexl)
 library(olsrr)
```

If you don’t have all these libraries on your computer, you might need to install them using the App-store accessed via the Package-Tab “Install” button.

1.5 Checklist

By now, you should have

Made your project, and downloaded the data from canvas into the project folder - along with any logos etc as you wish.
Created a nicely formatted main report file.
Created a nicely formatting background workings log file.
Added common libraries and code chunk options to both reports.

Task 2. Learn about the study

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 1&2 WORK

First, you need to learn more about our client and their needs. Read the text below and note what they want to achieve,their population of interest and any sub-populations they might care about.

2.1 Learn about Guardian Health

Who are Guardian Health?

Guardian Health is a (fictional) new company preparing to enter the US health insurance market in 2026. They aim to redefine health insurance by offering transparent, data-driven, and customer-focused coverage. Their goal is to bridge the gap between affordability and quality care, ensuring that individuals and families receive the support they need without unnecessary financial burdens.

Guardian Health are particularly focused on reaching younger demographics, such as university students, and communities that have traditionally faced barriers to affordable healthcare. To better serve these population, Guardian Health is also hoping to develop specialized policies tailored to their unique needs.

Innovation & Data-Driven Approach

Guardian Health distinguishes itself by integrating machine learning and actuarial science to personalize insurance plans. Their proprietary algorithms analyze consumer health trends, regional cost variations, and economic factors to create fair and adaptable pricing structures. Additionally, they emphasize preventative care incentives, rewarding policyholders who engage in wellness programs and routine health screenings.

Given the sensitivity of healthcare data, Guardian Health prioritizes ethical data use and privacy. They are committed to maintaining rigorous security standards and complying with HIPAA and other industry regulations. They are aiming to create their initial policies using the US Government MEPS dataset.

Future Goals

As a startup, Guardian Health need to set competitive yet sustainable premiums for their plans and customers across the United States while expanding their reach into underserved communities. By 2026, Guardian Health aims to establish itself as a reliable alternative to traditional insurers. Their roadmap includes:

Launching pilot programs in select states to test model effectiveness
Partnering with universities, community health organizations, and healthcare providers to expand coverage to students and low-income populations.
Expanding coverage options based on real-world data feedback.
Advocating for more transparent pricing in the insurance industry.

As Guardian Health moves closer to its market debut, they continue to refine their models and develop consumer education initiatives to empower individuals in making informed healthcare choices.

2.2 Learn about MEPS

2.3 Checklist

By now, you should

Understand more about our fictional client
Understand more about MEPS,ready to write up in your report (task 4)

Task 3. Access your dataset

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK

Originally we had code to read in MEPS from scratch. It turns out that data had many complexities, so your senior analyst, Dr Greatrex, read in the data, did most of the quality control and saved it as an Excel file for you.

3.1 Read in the MEPS data

3.2 Details about the data

3.3 Finish the quality control

I did most of the quality control, but there are a few things left for you to do.

Again, I would work out how to do this in my log/background file, then when I’m happy, copy the code over to the main report.

A. In your background workings, Create a section called quality control

B. Missing Data

For the sake of an undergrad project, I assumed that all quality control codes (see above) that were NOT -1 should be treated as missing data, NA. This means that R will ignore those values.

Unfortunately I forgot to assign -8 to NA. This code will do this for you, so add this to your quality control section and see how it changes the summary of the data.

 special_code <- -8
 
 health22 <- health22 %>% 
               mutate(across(everything(), ~ replace(.x, .x %in% special_code, NA)))

C. Categorical data and Factors

Some of our data is grouped/categorical and we need to make sure R understands it as such.

Include this code to make your uninsured data and sex columns categorical:

# because it already has labels
health22$UNINS22 <- as.factor(health22$UNINS22)

health22$SEX <- factor(health22$SEX, levels = c(1,2), 
                             labels=c("Male", "Female"))


# set this to a descriptor
health22$PID <- as.character(health22$PID)

3.4 Exploratory data analysis

3.4 Checklist

By now, you should

Have read your data into your log report
Finished some more data wrangling
Explore the data a little

Task 4. Write report introduction

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK

4.2 Checklist & rubric

If you did ANYTHING Guardian Health or I might find interesting on this dataset, then please keep it in your reports You don’t have to delete any old analysis - and I regularly award bonus points for interesting things you have found out about the data.

Top introductions will:

Clearly introduce the background for Guardian health. Identify the population Guardian Health are hoping to extrapolate their results to. (All humans who ever existed?)
Describe MEPS, its background and how it’s created - along with any relevant/interesting information.
Describe the actual excel dataset e.g. the
- the object-of-analysis,
- the dataset extent
- and key variables (you can copy/paste my list from above)
- explain if it’s a sample or a census and why.
Note any initial limitations e.g. does MEPS generalise to all populations? Do you have issues with individual variables/subpopulations?
Make sure the code runs in the main report to read in and quality control the data, then discuss any initial thoughts or potential limitations. Do you have issues with individual variables/subpopulations? Will our results help Guardian health with all their populations? If not, why not.

Here’s how I am grading this section. Importantly, I am grading based on content, not on grammar/English skills.

Grade	Marks	Rubric
A	NA	Clear, professional intro (300–500 words). Fully explains who the client is, their goals, and what they want to learn. Clearly identifies the statistical population and thoughtfully discusses its scope and limitations (e.g. who is or isn’t covered by MEPS). Strong link between client needs and the MEPS dataset.
B	NA	Covers the main elements: client, goals, population, and dataset. Some discussion of the population, though may miss key nuances or assumptions. Writing is mostly clear and well structured.
C	NA	Basic coverage of client and data. Mentions population but lacks depth or clarity. May not connect dataset to client needs clearly. Writing is understandable but may be vague or loosely organized.
D	NA	One or more key pieces are missing or unclear (e.g., no population mentioned, unclear who the client is, vague use of the dataset). Ideas may be underdeveloped or hard to follow. Effort recognized, but revision needed. Clear thinking is more important than polished grammar.

Task 5. Exploratory Data Analysis

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK

5.1 Analyse TOTEXP22

5.2 Look at correlations with TOTEXP22

Now in your log file, explore the relationships between TOTEXP22 and different predictor variables. This code makes the core correlation matrix

cortable <- cor(health22_numeric,use="pairwise.complete.obs")
cortable

You could use correlation matrices: https://psu-spatial.github.io/Stat462-2025/in_T13_Correlation.html
Or make individual scatterplots https://psu-spatial.github.io/Stat462-2025/in_T11_Plots.html#4_Scatterplots (only have to be neat in your final report)

If you want to see all the scatterplots in one place, you can use this code - where you change COLNAME to the name of the column you wish to explore e.g. TOTEXP22. Or you could just copy/paste the scatterplot code or change the variable name to take a look at them.

# THIS TAKES A MINUTE TO RUN!

# Rearrange the format
 health22_longformat <- health22 %>%
   pivot_longer(cols = -COLNAME,   #  KEEP THE -sign in front!  e.g. -TOTEXP22
                names_to = "predictor", 
                values_to = "value")
 
 # then plot using ggplot
 ggplot(health22_longformat, aes(x = value, y = COLNAME)) +
   geom_point(alpha = 0.3,size=1) +
   facet_wrap(~ predictor, scales = "free_x") +
   theme_minimal()

5.3 Look at correlations with ln_TOTEXP22

5.4 Look at correlations with non-zero health costs

Finally, it appears that many people paid zero for their health costs (e.g. they were healthy??). This skews our histogram and its reasonable to assume that the predictors causing someone to be yes/no unhealthy are different to the predictors causing someone to pay more money if they are.

This code will filter the data to ONLY positive healthcare costs.

  health22_TOTEXP22_gt0 <-  health22[which(health22$TOTEXP22> 0),]

get this code running and repeat your quality control. Remember you can copy/paste code.

5.5 Write up what you have found in the main report

Create a new section in your main report called Quality Control and EDA (or similar). This is where you’ll present the final version of your exploratory data analysis and quality control steps.

Style - Use clear subheadings, bullets, and brief comments so your client (Guardian Health) can follow along. - Only include plots/tables that help tell a clear story. You can always refer clients back to your log files.

Code to include:

Code to read in the dataset (if not already in your main report)
Code for any quality control/cleaning steps (e.g. removing missing data or non-numeric columns)
Code for summary statistics and distributions for TOTEXP22 and key variables
Code for correlations and/or scatterplots with TOTEXP22 and ln_TOTEXP22
Code for your exploration of cases with TOTEXP22 > 0

What to write:

Explain what you did for quality control and why.
Explain what you did for exploratory data analysis.
Highlight any key insights or limitations (e.g. missing data, skewed variables, zero costs, strange variables).
Reflects on what this means for future modeling and Guardian Health’s goals

You don’t need to include everything — focus on what’s useful and interesting.

:::

5.6 Checklist & rubric

Top write-ups will:

Clearly explain QC and EDA steps taken
Include relevant summaries or visualizations
Note any data issues or surprises
Use neat, well-organized code
Reflect on what this means for the analysis and the client

Here’s how I am grading this section. Importantly, I am grading based on content, not on grammar/English skills.

rubric <- readxl::read_excel("Table_ProjectRubric.xlsx")
knitr::kable(rubric) %>%   
  kable_classic_2() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"))

Task 6 SLR question 1

6.1 Age and total expenses

6.2 “Best model”

If you want to compete.. here is the best R2, RMSE that I found a single predictor so far.

6.3 Write up your results

In your main report, create a clearly labeled section called “Simple Linear Regression” or “SLR Models”. Underneath, copy over your neat code and write-up for the two models. Just like before, keep things tidy and only include plots and outputs that help tell your story.

Optional, try playing with code chunk options to show/hide code.

For each model, write clearly and professionally so someone at Guardian Health could understand your choices and results.

For AGE - your write-up should include:

A brief causal explanation (why the predictor might affect total annual expenses)
The regression equation, written out using variable names your client will recognize, making it clear if this is the sample regression or population regression line - and which variant of the response variable you chose.
Real-life meaning of the slope and intercept
Effect size (the real life impact of the slope, see Lab 3) vs and statistical significance of the slope (e.g. confidence intervals/T-tests),
A check of model assumptions (LINE) and whether they hold alongside any strategies if they don’t.
A prediction (with confidence interval) for a specific example (e.g. a 40-year-old)

For your best model - your write-up should include:

7. Conclusions

Finally, write a short summary/conclusion in your main report.

If you could choose several predictors, state what you would choose and why

BONUS questions for an A**

You don’t need to answer these for an A, but you do for a top 100% score

Its often said that BMI (body mass index) will impact how healthy someone is. Do you see evidence to support this on total healthcare costs?
Does mental health or physical health have more of an impact on total healthcare costs?
What do you think the influence of COVID is on these results and Guardian health’s ambitions?

Project 1 - PART 1.

Introduction

Task 1. Create professional documents

1.1 Create your project

1.2 Create your Lab reports

1.3 Add code-chunk options

1.4 Add libraries to both reports

1.5 Checklist

Task 2. Learn about the study

2.1 Learn about Guardian Health

2.2 Learn about MEPS

2.3 Checklist

Task 3. Access your dataset

3.1 Read in the MEPS data

3.2 Details about the data

3.3 Finish the quality control

3.4 Exploratory data analysis

3.4 Checklist

Task 4. Write report introduction

4.2 Checklist & rubric

Task 5. Exploratory Data Analysis

5.1 Analyse TOTEXP22

5.2 Remove missing values and non-numeric columns

5.2 Look at correlations with TOTEXP22

5.3 Look at correlations with ln_TOTEXP22

5.4 Look at correlations with non-zero health costs

5.5 Write up what you have found in the main report

Code to include:

What to write:

5.6 Checklist & rubric

Task 6 SLR question 1

6.1 Age and total expenses

6.2 “Best model”

6.3 Write up your results

For AGE - your write-up should include:

For your best model - your write-up should include:

7. Conclusions

BONUS questions for an A**

Grading rubric