It’s said that the best way to learn a topic is to teach it… so in this project, you are explaining logistic regression to medical professionals, focusing on a case study to understand what causes low baby birth weights.
Style
Your style should be clear, understandable, explaining any jargon as
simply as you can. You are also welcome to include in hand-drawn
sketches, or CITED diagrams from the internet.
What to submit
You will your main report (RmD and Html). You are optionally also
allowed to submit a second report containing background workings
etc.
Your project will be based on the following case study. It’s simulated data but I think it closely reflects actual studies.
Your task is to write an article explaining the analysis of this data-set, in the style of a science magazine or say “Nurses Weekly”, following my regression prompts.
Download the data from canvas here: https://psu.instructure.com/courses/2381686/assignments/17157475
“Understanding birth weight: A Study at Hereford Hospital (1980–1984)”
Between 1980 and 1984, staff at Hereford Hospital, a single mid-sized regional hospital in the west of England, conducted a study to explore why some babies are born with low birthweight — defined as less than 2,500 grams. Low birthweight is an important public health concern, as it increases the risk of infant illness, developmental problems, and early death.
The hospital’s maternity unit recorded detailed information about 189 full-term births during this period. The study focused on mothers’ health, age, habits, and access to prenatal care, to understand which factors might be linked to low birth-weight. Hospital staff collected data directly from patient records, birth registries, and prenatal visit notes. The goal was to inform better care guidelines and identify patients who might benefit from more targeted support during pregnancy.
Birth_weight - birth weight in grams.
LowBW - Indicator of birth weight less than 2.5 kg.
(0 = normal, 1 = low birthweight)
Age - mother’s age in years.
Weight - mother’s weight (lb)
Smoke - smoking status during pregnancy (TRUE or
FALSE)
Prev_labor - number of previous premature
labors.
Heart - history of hypertension, or heart issues (0
= no, 1=yes)
Num_DrVisits - number of doctors visits during the
first trimester.
This part is almost identical to project 1, but there are a few more libraries.
This is where you will write up your tutorial about regression. You want it to be easy to read, engaging and with all code/statistics/concepts explained. At minimum:
It should be clear that the report is written by you - give it a good title!
It should have a table of contents and use any theme
It should look professional, with no unneeded text (like the “welcome to R” text)
Unneeded output like library loading, or long data printouts should be suppressed.
You should use headings and sub-headings to guide the reader through the document
It should be saved as a filename including your email id e.g. hlg5155 and the word REPORT\
Like project 1, it can be easier to work out your code and what to do, then write it up neatly in a main report. So you are welcome to create/use a background workings report if you choose to.
This is your background work, it doesn’t need to be neat or in English
Unneeded code output like library loading text printouts should be suppressed.
It should be saved as a filename including your email id e.g. hlg5155 and the word WORKINGS\
Somewhere near the top of each of your RmD files make a code chunk and load the following packages. If you don’t have them on your computer, you can load them using the app store. Remember to suppress the code chunk output. If you don’t remember how to do this, see previous lab reports and the instructions above.
tidyverse
ggstatsplot
ggplot2
olsrr
openintro
statsr
broom
blorr
Understand the study.
If it’s useful, read in the data first and come back to this
In your report first describe why medical professionals might be concerned about low birth weights (you can google this!) alongside.
We are going to start by loading in your data, playing with it, getting everything working and making sure you understand everything.
Download your data from canvas and put it in your project folder https://psu.instructure.com/courses/2381686/assignments/17157475
Use the read_excel() command to get it into
R.
Now look at summaries, make sure that categorical data is correct, look for strange output, missing values. Basically, get the data neat and tidy, explaining any R output in the report.
Look at the summary statistics of your dataset
Look for strange values (-99) that could be mistakes, or need to be assigned as missing
Make sure that all categorical variables have been turned into factors,
Explore the data
Write up what you did.
For each variable
Explain why might each variable lead to differing birth rates. Refer to plots/tables/exploratory analysis to assess if you think you see this output in your data.
Predict which variable that you think will have the largest impact on low birth-weight
All the steps needed are explained in the Titanic Logistic tutorial and in the class notes (and online) https://psu-spatial.github.io/Stat462-2025/in_T20_Logistic.html or even better, the sleep one here on canvas https://psu.instructure.com/courses/2381686/files/177167369?module_item_id=44758669
glm() function with
family = binomial(link = "logit").data = ... argument.Note, these are only selected aspects of logistic regression to keep the project short. Also, remember the class R notes where I explain many of these things in detail.
You do not need to look write out all the steps above unless it’s useful E.g. you can just fit the model using the glm command and look at the summary. Then explain to the reader…
Out of the two models, explain whether age or weight has the larger effect size.
Out of the two models, explain whether age or weight is more likely to have a statistically significant and reproducible effect at the 5% level?
Compare the AIC values across the two models (you can read/compare them directly from the model summaries) and explain which is ‘best’.
Given everything above, which model would you personally recommend and why?
What was the most challenging part of this project for you — and how did you work through it?
If you had more time or more data, what would be the next question you’d want to explore?
Congratulations! Finished!
Originally, this was either going to be peer reviewed or supported via a quiz.
However the length of time we spent on Project 1 meant that I had to cut this project short, so instead of the quiz, I am again awarding 50 points for participation. e.g. you are being graded out of 180 for this project (as promised) - and I will give you 50 points for free
Overall, here is what your lab should correspond to:
| Topic | Points | Explanation |
|---|---|---|
| Clarity and Communication | 30 | Clear, engaging explanations suitable for a medical/professional audience. Avoids jargon or explains it well. Organized with headings, good writing style. |
| Code Quality & Reproducibility | 25 | Well-organized, commented code with suppressed unnecessary output. Report knits successfully. |
| Exploratory Data Analysis | 40 | Correct use of summary statistics, data cleaning, and visualization. Demonstrates understanding of variable types and dataset limitations. |
| Modeling & Interpretation: Weight | 40 | Correct logistic model setup; thoughtful interpretation of log-odds, odds, and predicted probabilities. Uses confidence intervals and fit measures appropriately. |
| Modeling & Comparison: Age vs Weight | 30 | Accurate comparison of model outputs (effect size, p-values, AIC). Clear explanation of which model is better and why. |
| Reflection | 15 | Thoughtful and honest reflection on the project process and learning. |
|
|
|
|
| PLUS 50 POINTS PARTICIPATION | 50 | For getting this far |
| TOTAL | 230 |
|
And.. finished!