Project 1 - PART 1.

Introduction

Guardian Health is a new start-up company who is preparing to enter the US health insurance market in 2026. They need to set competitive yet sustainable premiums for their plans and customers across the United States but because they are new, they lack historical claims data. Instead, they ave contracted our consulting firm, 462-Investigations to use the Government’s MEPS longitudinal dataset (2022) to build a predictive model of total annual healthcare costs (TOTEXP22).

As a junior analyst in STAT462-Investigations, your role on our team is to

  1. Conduct the background research on the dataset and explore the data
  2. Use Simple Linear Regression to answer Guardian Health’s questions about the expected annual medical expenses for their customers

What will you submit?

  1. You have already completed a quiz about Guardian health and the project (worth 20 points)

  2. You get 30 bonus points (previously the peer review) for participating in this roller coaster of a project

  3. By 23.59pm WEDNESDAY 23rd April, I expect you to submit TWO reports to submit to Guardian Health. (Worth 180 points):

    Final Report (Client-Facing Document)
    This is a professionally formatted consulting report, ensuring that all results are interpreted and explained clearly for a non-statistical audience, including

    • An introduction to the MEPS-2022 dataset, explaining its purpose, structure, and limitations.

    • A summary of exploratory data analysis (EDA) and data quality checks.

    • A full description of your Simple Linear Regression (SLR) models to answer the client questions, including rationale, key assumptions, and results.

    • A conclusion discussing limitations and caveats, ensuring Guardian Health understands the scope and reliability of these predictions for the company’s launch in 2026..


    Modelling Logs (Technical appendix/background workings)
    This is a record of your entire workflow and decision-making process, so I can see everything you tried, but you can keep the final report neat.
    The logs can be written in a more casual style (including non English languages if English is your second language) and is designed to include things like

    • All your messy quality control and exploratory data analysis

    • Trying a load of models, then choosing your favorite for the main report

    • All your checking/double checking assumptions, fixing transformations etc


All the instructions to create these reports are included below, and there is a detailed rubric at the end.


Task 1. Create professional documents

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 1&2 WORK

1.1 Create your project

1.2 Create your Lab reports


1.3 Add code-chunk options


1.4 Add libraries to both reports


1.5 Checklist

By now, you should have

  1. Made your project, and downloaded the data from canvas into the project folder - along with any logos etc as you wish.
  2. Created a nicely formatted main report file.
  3. Created a nicely formatting background workings log file.
  4. Added common libraries and code chunk options to both reports.



Task 2. Learn about the study

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 1&2 WORK

First, you need to learn more about our client and their needs. Read the text below and note what they want to achieve,their population of interest and any sub-populations they might care about.

2.1 Learn about Guardian Health

2.2 Learn about MEPS

2.3 Checklist

By now, you should

  • Understand more about our fictional client
  • Understand more about MEPS,ready to write up in your report (task 4)



Task 3. Access your dataset

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK

Originally we had code to read in MEPS from scratch. It turns out that data had many complexities, so your senior analyst, Dr Greatrex, read in the data, did most of the quality control and saved it as an Excel file for you.

3.1 Read in the MEPS data


3.2 Details about the data


3.3 Finish the quality control

I did most of the quality control, but there are a few things left for you to do.


3.4 Exploratory data analysis

3.4 Checklist

By now, you should

  • Have read your data into your log report
  • Finished some more data wrangling
  • Explore the data a little



Task 4. Write report introduction

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK

4.2 Checklist & rubric

If you did ANYTHING Guardian Health or I might find interesting on this dataset, then please keep it in your reports You don’t have to delete any old analysis - and I regularly award bonus points for interesting things you have found out about the data.

Top introductions will:

  1. Clearly introduce the background for Guardian health. Identify the population Guardian Health are hoping to extrapolate their results to. (All humans who ever existed?)

  2. Describe MEPS, its background and how it’s created - along with any relevant/interesting information.

  3. Describe the actual excel dataset e.g. the

    • the object-of-analysis,
    • the dataset extent
    • and key variables (you can copy/paste my list from above)
    • explain if it’s a sample or a census and why.
  4. Note any initial limitations e.g. does MEPS generalise to all populations? Do you have issues with individual variables/subpopulations?

  5. Make sure the code runs in the main report to read in and quality control the data, then discuss any initial thoughts or potential limitations. Do you have issues with individual variables/subpopulations? Will our results help Guardian health with all their populations? If not, why not.

Here’s how I am grading this section. Importantly, I am grading based on content, not on grammar/English skills.

Grade Marks Rubric
A NA Clear, professional intro (300–500 words). Fully explains who the client is, their goals, and what they want to learn. Clearly identifies the statistical population and thoughtfully discusses its scope and limitations (e.g. who is or isn’t covered by MEPS). Strong link between client needs and the MEPS dataset.
B NA Covers the main elements: client, goals, population, and dataset. Some discussion of the population, though may miss key nuances or assumptions. Writing is mostly clear and well structured.
C NA Basic coverage of client and data. Mentions population but lacks depth or clarity. May not connect dataset to client needs clearly. Writing is understandable but may be vague or loosely organized.
D NA One or more key pieces are missing or unclear (e.g., no population mentioned, unclear who the client is, vague use of the dataset). Ideas may be underdeveloped or hard to follow. Effort recognized, but revision needed. Clear thinking is more important than polished grammar.



Task 5. Exploratory Data Analysis

THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK

5.1 Analyse TOTEXP22


5.2 Look at correlations with TOTEXP22


5.3 Look at correlations with ln_TOTEXP22


5.4 Look at correlations with non-zero health costs



5.5 Write up what you have found in the main report

Create a new section in your main report called Quality Control and EDA (or similar). This is where you’ll present the final version of your exploratory data analysis and quality control steps.

Style - Use clear subheadings, bullets, and brief comments so your client (Guardian Health) can follow along. - Only include plots/tables that help tell a clear story. You can always refer clients back to your log files.

Code to include:

  • Code to read in the dataset (if not already in your main report)
  • Code for any quality control/cleaning steps (e.g. removing missing data or non-numeric columns)
  • Code for summary statistics and distributions for TOTEXP22 and key variables
  • Code for correlations and/or scatterplots with TOTEXP22 and ln_TOTEXP22
  • Code for your exploration of cases with TOTEXP22 > 0

What to write:

  1. Explain what you did for quality control and why.
  2. Explain what you did for exploratory data analysis.
  3. Highlight any key insights or limitations (e.g. missing data, skewed variables, zero costs, strange variables).
  4. Reflects on what this means for future modeling and Guardian Health’s goals

You don’t need to include everything — focus on what’s useful and interesting.

:::

5.6 Checklist & rubric

Top write-ups will:

  • Clearly explain QC and EDA steps taken

  • Include relevant summaries or visualizations

  • Note any data issues or surprises

  • Use neat, well-organized code

  • Reflect on what this means for the analysis and the client

Here’s how I am grading this section. Importantly, I am grading based on content, not on grammar/English skills.

rubric <- readxl::read_excel("Table_ProjectRubric.xlsx")
knitr::kable(rubric) %>%   
  kable_classic_2() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"))



Task 6 SLR question 1

6.1 Age and total expenses


6.2 “Best model”

If you want to compete.. here is the best R2, RMSE that I found a single predictor so far.


6.3 Write up your results

In your main report, create a clearly labeled section called “Simple Linear Regression” or “SLR Models”. Underneath, copy over your neat code and write-up for the two models. Just like before, keep things tidy and only include plots and outputs that help tell your story.

Optional, try playing with code chunk options to show/hide code.

For each model, write clearly and professionally so someone at Guardian Health could understand your choices and results.

For AGE - your write-up should include:

  • A brief causal explanation (why the predictor might affect total annual expenses)

  • The regression equation, written out using variable names your client will recognize, making it clear if this is the sample regression or population regression line - and which variant of the response variable you chose.

  • Real-life meaning of the slope and intercept

  • Effect size (the real life impact of the slope, see Lab 3) vs and statistical significance of the slope (e.g. confidence intervals/T-tests),

  • A check of model assumptions (LINE) and whether they hold alongside any strategies if they don’t.

  • A prediction (with confidence interval) for a specific example (e.g. a 40-year-old)


For your best model - your write-up should include:



7. Conclusions

Finally, write a short summary/conclusion in your main report.

If you could choose several predictors, state what you would choose and why

BONUS questions for an A**

You don’t need to answer these for an A, but you do for a top 100% score

  • Its often said that BMI (body mass index) will impact how healthy someone is. Do you see evidence to support this on total healthcare costs?

  • Does mental health or physical health have more of an impact on total healthcare costs?

  • What do you think the influence of COVID is on these results and Guardian health’s ambitions?

Grading rubric