Guardian Health is a new start-up company who is preparing to enter the US health insurance market in 2026. They need to set competitive yet sustainable premiums for their plans and customers across the United States but because they are new, they lack historical claims data. Instead, they ave contracted our consulting firm, 462-Investigations to use the Government’s MEPS longitudinal dataset (2022) to build a predictive model of total annual healthcare costs (TOTEXP22).
As a junior analyst in STAT462-Investigations, your role on our team is to
What will you submit?
You have already completed a quiz about Guardian health and the project (worth 20 points)
You get 30 bonus points (previously the peer review) for participating in this roller coaster of a project
By 23.59pm WEDNESDAY 23rd April, I expect you to submit TWO reports to submit to Guardian Health. (Worth 180 points):
Final Report (Client-Facing Document)
This is a professionally formatted consulting report, ensuring that all results are interpreted and explained
clearly for a non-statistical audience, including
An introduction to the MEPS-2022 dataset, explaining its purpose, structure, and limitations.
A summary of exploratory data analysis (EDA) and data quality checks.
A full description of your Simple Linear Regression (SLR) models to answer the client questions, including rationale, key assumptions, and results.
A conclusion discussing limitations and caveats, ensuring Guardian Health understands the scope and reliability
of these predictions for the company’s launch in 2026..
Modelling Logs (Technical appendix/background workings)
This is a record of your entire workflow and decision-making process, so I can see everything you tried, but you
can keep the final report neat.
The logs can be written in a more casual style (including non English languages if English is your second language)
and is designed to include things like
All your messy quality control and exploratory data analysis
Trying a load of models, then choosing your favorite for the main report
All your checking/double checking assumptions, fixing transformations etc
All the instructions to create these reports are included below, and there is a detailed rubric at the end.
THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 1&2 WORK
By now, you should have
THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 1&2 WORK
First, you need to learn more about our client and their needs. Read the text below and note what they want to achieve,their population of interest and any sub-populations they might care about.
By now, you should
THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK
Originally we had code to read in MEPS from scratch. It turns out that data had many complexities, so your senior analyst, Dr Greatrex, read in the data, did most of the quality control and saved it as an Excel file for you.
I did most of the quality control, but there are a few things left for you to do.
By now, you should
THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK
If you did ANYTHING Guardian Health or I might find interesting on this dataset, then please keep it in your reports You don’t have to delete any old analysis - and I regularly award bonus points for interesting things you have found out about the data.
Top introductions will:
Clearly introduce the background for Guardian health. Identify the population Guardian Health are hoping to extrapolate their results to. (All humans who ever existed?)
Describe MEPS, its background and how it’s created - along with any relevant/interesting information.
Describe the actual excel dataset e.g. the
Note any initial limitations e.g. does MEPS generalise to all populations? Do you have issues with individual variables/subpopulations?
Make sure the code runs in the main report to read in and quality control the data, then discuss any initial thoughts or potential limitations. Do you have issues with individual variables/subpopulations? Will our results help Guardian health with all their populations? If not, why not.
Here’s how I am grading this section. Importantly, I am grading based on content, not on grammar/English skills.
| Grade | Marks | Rubric |
|---|---|---|
| A | NA | Clear, professional intro (300–500 words). Fully explains who the client is, their goals, and what they want to learn. Clearly identifies the statistical population and thoughtfully discusses its scope and limitations (e.g. who is or isn’t covered by MEPS). Strong link between client needs and the MEPS dataset. |
| B | NA | Covers the main elements: client, goals, population, and dataset. Some discussion of the population, though may miss key nuances or assumptions. Writing is mostly clear and well structured. |
| C | NA | Basic coverage of client and data. Mentions population but lacks depth or clarity. May not connect dataset to client needs clearly. Writing is understandable but may be vague or loosely organized. |
| D | NA | One or more key pieces are missing or unclear (e.g., no population mentioned, unclear who the client is, vague use of the dataset). Ideas may be underdeveloped or hard to follow. Effort recognized, but revision needed. Clear thinking is more important than polished grammar. |
THIS WOULD HAVE BEEN DONE AS PART OF YOUR WEEK 2 WORK
Create a new section in your main report called Quality Control and EDA (or similar). This is where you’ll present the final version of your exploratory data analysis and quality control steps.
Style - Use clear subheadings, bullets, and brief comments so your client (Guardian Health) can follow along. - Only include plots/tables that help tell a clear story. You can always refer clients back to your log files.
You don’t need to include everything — focus on what’s useful and interesting.
:::
Top write-ups will:
Clearly explain QC and EDA steps taken
Include relevant summaries or visualizations
Note any data issues or surprises
Use neat, well-organized code
Reflect on what this means for the analysis and the client
Here’s how I am grading this section. Importantly, I am grading based on content, not on grammar/English skills.
rubric <- readxl::read_excel("Table_ProjectRubric.xlsx")
knitr::kable(rubric) %>%
kable_classic_2() %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
If you want to compete.. here is the best R2, RMSE that I found a single predictor so far.
In your main report, create a clearly labeled section called “Simple Linear Regression” or “SLR Models”. Underneath, copy over your neat code and write-up for the two models. Just like before, keep things tidy and only include plots and outputs that help tell your story.
Optional, try playing with code chunk options to show/hide code.
For each model, write clearly and professionally so someone at Guardian Health could understand your choices and results.
A brief causal explanation (why the predictor might affect total annual expenses)
The regression equation, written out using variable names your client will recognize, making it clear if this is the sample regression or population regression line - and which variant of the response variable you chose.
Real-life meaning of the slope and intercept
Effect size (the real life impact of the slope, see Lab 3) vs and statistical significance of the slope (e.g. confidence intervals/T-tests),
A check of model assumptions (LINE) and whether they hold alongside any strategies if they don’t.
A prediction (with confidence interval) for a specific example (e.g. a 40-year-old)
Finally, write a short summary/conclusion in your main report.
If you could choose several predictors, state what you would choose and why
You don’t need to answer these for an A, but you do for a top 100% score
Its often said that BMI (body mass index) will impact how healthy someone is. Do you see evidence to support this on total healthcare costs?
Does mental health or physical health have more of an impact on total healthcare costs?
What do you think the influence of COVID is on these results and Guardian health’s ambitions?