1 Overview

It’s said that the best way to learn a topic is to teach it… so in this project, you are explaining logistic regression to medical professionals, focusing on a case study to understand what causes low baby birth weights.

Style

Your style should be clear, understandable, explaining any jargon as simply as you can. You are also welcome to include in hand-drawn sketches, or CITED diagrams from the internet.

What to submit

You will your main report (RmD and Html). You are optionally also allowed to submit a second report containing background workings etc.

2 Details about your case study

Your project will be based on the following case study. It’s simulated data but I think it closely reflects actual studies.

Your task is to write an article explaining the analysis of this data-set, in the style of a science magazine or say “Nurses Weekly”, following my regression prompts.

Download the data from canvas here: https://psu.instructure.com/courses/2381686/assignments/17157475

“Understanding birth weight: A Study at Hereford Hospital (1980–1984)”

Between 1980 and 1984, staff at Hereford Hospital, a single mid-sized regional hospital in the west of England, conducted a study to explore why some babies are born with low birthweight — defined as less than 2,500 grams. Low birthweight is an important public health concern, as it increases the risk of infant illness, developmental problems, and early death.

The hospital’s maternity unit recorded detailed information about 189 full-term births during this period. The study focused on mothers’ health, age, habits, and access to prenatal care, to understand which factors might be linked to low birth-weight. Hospital staff collected data directly from patient records, birth registries, and prenatal visit notes. The goal was to inform better care guidelines and identify patients who might benefit from more targeted support during pregnancy.

Birth_weight - birth weight in grams.
LowBW - Indicator of birth weight less than 2.5 kg. (0 = normal, 1 = low birthweight)
Age - mother’s age in years.
Weight - mother’s weight (lb)
Smoke - smoking status during pregnancy (TRUE or FALSE)
Prev_labor - number of previous premature labors.
Heart - history of hypertension, or heart issues (0 = no, 1=yes)
Num_DrVisits - number of doctors visits during the first trimester.

3 Getting Started

This part is almost identical to project 1, but there are a few more libraries.

Create a new project for Project 2, in order to store your data, analysis & reports.
Instructions here https://psu-spatial.github.io/Stat462-2025/CH2_EACHLAB.html#1_R-Projects

Create your main report RmD.
Instructions here: https://psu-spatial.github.io/Stat462-2025/CH2_EACHLAB.html#33_Creating_R-Markdown_Files

This is where you will write up your tutorial about regression. You want it to be easy to read, engaging and with all code/statistics/concepts explained. At minimum:
- It should be clear that the report is written by you - give it a good title!
- It should have a table of contents and use any theme
- It should look professional, with no unneeded text (like the “welcome to R” text)
- Unneeded output like library loading, or long data printouts should be suppressed.
- You should use headings and sub-headings to guide the reader through the document
- It should be saved as a filename including your email id e.g. hlg5155 and the word REPORT\

Optional! Create an RmD for your background workings.
Instructions here: https://psu-spatial.github.io/Stat462-2025/CH2_EACHLAB.html#33_Creating_R-Markdown_Files

Like project 1, it can be easier to work out your code and what to do, then write it up neatly in a main report. So you are welcome to create/use a background workings report if you choose to.
- This is your background work, it doesn’t need to be neat or in English
- Unneeded code output like library loading text printouts should be suppressed.
- It should be saved as a filename including your email id e.g. hlg5155 and the word WORKINGS\

Load packages
Instructions here: https://psu-spatial.github.io/Stat462-2025/CH2_EACHLAB.html#2_Packages

Somewhere near the top of each of your RmD files make a code chunk and load the following packages. If you don’t have them on your computer, you can load them using the app store. Remember to suppress the code chunk output. If you don’t remember how to do this, see previous lab reports and the instructions above.
- tidyverse
- ggstatsplot
- ggplot2
- olsrr
- openintro
- statsr
- broom
- blorr

Understand the study.

If it’s useful, read in the data first and come back to this

In your report first describe why medical professionals might be concerned about low birth weights (you can google this!) alongside.
- The background of the dataset, how it was collected (and why)
- The object-of-analysis, dataset extent, each variable & units.
- The larger population our sample is reasonably representative of in time & space. Does the dataset reflect broader populations, or only a specific group? Remember you can google Hereford’s demographics!
- Note down any potential limitations/fallacies/issues with the data.

4 Basic data analysis

We are going to start by loading in your data, playing with it, getting everything working and making sure you understand everything.

Read in your data

Download your data from canvas and put it in your project folder https://psu.instructure.com/courses/2381686/assignments/17157475
Use the read_excel() command to get it into R.

Quality control/explore the data

Now look at summaries, make sure that categorical data is correct, look for strange output, missing values. Basically, get the data neat and tidy, explaining any R output in the report.
- Look at the summary statistics of your dataset
- Look for strange values (-99) that could be mistakes, or need to be assigned as missing
- Make sure that all categorical variables have been turned into factors,
- Explore the data
- Write up what you did.

Consider causal chains

For each variable
- Explain why might each variable lead to differing birth rates. Refer to plots/tables/exploratory analysis to assess if you think you see this output in your data.
- Predict which variable that you think will have the largest impact on low birth-weight

5 Logistic regression vs mother’s weight

Use logistic regression and a logit transform to examine whether MOTHER’S WEIGHT is likely to lead to low birth-weight babies

All the steps needed are explained in the Titanic Logistic tutorial and in the class notes (and online) https://psu-spatial.github.io/Stat462-2025/in_T20_Logistic.html or even better, the sleep one here on canvas https://psu.instructure.com/courses/2381686/files/177167369?module_item_id=44758669

1. Fit a logistic regression model
- Use the glm() function with family = binomial(link = "logit").
- Specify your dataset in the data = ... argument.

2. Write down the model equation using symbols
- Express the model using notation like \(\beta_1\), \(b_1\) etc.
- Clearly state whether you are describing a sample model (using \(b_0\), \(b_1\)..) or the full population model (using \(\beta_0\) , \(\beta_1\)).

3. Rewrite the model using your fitted coefficients
- Replace the symbols with the numerical estimates from your model output.

4. Interpret the intercept and slope
- Interpret the intercept and slope in terms of both Log-odds and Odds
  - Include 99% confidence intervals for each coefficient
  - Discuss the Wald test from the summary output
  - Make sure to explain everything above in terms of mothers/babies etc.

5. Calculate a probability change
- Calculate the change in probability of having a low-weight baby for a mother who is 100lb compared to a mother who is 110lb.

6. Evaluate model fit
- Choose EITHER ONE of the following:
  - Perform a Hosmer-Lemeshow test, or
  - Report a Pseudo R² value.
- In a few sentences, explain what this statistic tells you in general, AND what the output says about the fit of this specific model (e.g. a good fit?). Explain your reasoning.

Note, these are only selected aspects of logistic regression to keep the project short. Also, remember the class R notes where I explain many of these things in detail.

6 Logistic regression vs Age

Now use logistic regression and a logit transform to examine whether MOTHER’S AGE is likely to lead to low birth-weight babies and look at the summary

You do not need to look write out all the steps above unless it’s useful E.g. you can just fit the model using the glm command and look at the summary. Then explain to the reader…
- Out of the two models, explain whether age or weight has the larger effect size.
- Out of the two models, explain whether age or weight is more likely to have a statistically significant and reproducible effect at the 5% level?
- Compare the AIC values across the two models (you can read/compare them directly from the model summaries) and explain which is ‘best’.
- Given everything above, which model would you personally recommend and why?

7 Reflection

Think back to the goals of this project: to explain a statistical concept clearly to a non-statistical audience. In 3–5 sentences, respond to one or more of the following prompts:

What was the most challenging part of this project for you — and how did you work through it?
If you had more time or more data, what would be the next question you’d want to explore?

8 What to submit?

Congratulations! Finished!

Knit your final report and submit the html and the rmd file.
see below for the rubric

8.1 Final 50 point quiz?

Originally, this was either going to be peer reviewed or supported via a quiz.

However the length of time we spent on Project 1 meant that I had to cut this project short, so instead of the quiz, I am again awarding 50 points for participation. e.g. you are being graded out of 180 for this project (as promised) - and I will give you 50 points for free

8.2 Grading rubric

Overall, here is what your lab should correspond to:

Topic	Points	Explanation
Clarity and Communication	30	Clear, engaging explanations suitable for a medical/professional audience. Avoids jargon or explains it well. Organized with headings, good writing style.
Code Quality & Reproducibility	25	Well-organized, commented code with suppressed unnecessary output. Report knits successfully.
Exploratory Data Analysis	40	Correct use of summary statistics, data cleaning, and visualization. Demonstrates understanding of variable types and dataset limitations.
Modeling & Interpretation: Weight	40	Correct logistic model setup; thoughtful interpretation of log-odds, odds, and predicted probabilities. Uses confidence intervals and fit measures appropriately.
Modeling & Comparison: Age vs Weight	30	Accurate comparison of model outputs (effect size, p-values, AIC). Clear explanation of which model is better and why.
Reflection	15	Thoughtful and honest reflection on the project process and learning.

PLUS 50 POINTS PARTICIPATION	50	For getting this far
TOTAL	230

And.. finished!

Project 2: A regression teaching guide