7 Missing Data In Depth

This tutorial goes deeper on missing data than the QC steps covered in Tutorial 5. It covers how to identify why data is missing, how to visualise it, and how to handle it appropriately depending on your situation — including how missing data interacts with regression models.

Before you start: Make sure you have the following packages installed and loaded in your library code chunk: skimr, dplyr, naniar, tidyr, and tidyverse.

We will use a small example dataset throughout so you can see exactly what each command does:

testdata <- data.frame(
  x = c(1, 2, 3, NA, 5),
  y = c("A", "B", "C", "D", NA),
  z = c(10, 15, NA, 20, 25),
  randomnotes = c(NA, NA, NA, "hello", NA)
)
print(testdata)

##    x    y  z randomnotes
## 1  1    A 10        <NA>
## 2  2    B 15        <NA>
## 3  3    C NA        <NA>
## 4 NA    D 20       hello
## 5  5 <NA> 25        <NA>

7.1 What is missing data?

In R, missing values are stored as NA. You may also encounter NaN (Not a Number), which appears when a calculation produces an undefined result (e.g. dividing zero by zero).

7.1.0.1 Missing values in external datasets

When reading in data from outside R, missing values are not always stored as NA. Common placeholders include:

-1 (common in government datasets)
99, -99, 9999, 999 (arbitrary placeholders)
Blank cells in spreadsheets

If you know your dataset uses non-standard codes for missing values, you can tell R what to look for when reading in the file. Always check your data carefully afterwards to make sure it worked correctly.

For CSV files:

mydata <- read.csv("data.csv", na.strings = c("", "NA", "-1", "99", "9999"))

For Excel files:

mydata <- read_excel("data.xlsx", na = c("", "NA", "-1", "999"))

na.strings (CSV) and na (Excel) both tell R which values to treat as NA on import.

7.1.0.2 Turning existing values into NA

Sometimes you only notice a missing data code after reading the data in. You can overwrite specific values with NA directly. For example, to replace any impossible negative prices in a house price dataset:

HousesNY$Price[HousesNY$Price < 0] <- NA

7.2 Why is data missing?

Understanding why data is missing matters — it affects which approach to handling it is appropriate.

Missing Completely at Random (MCAR) The probability of missingness is unrelated to any variable in the dataset. Example: a sensor randomly fails. If data is MCAR, removing missing cases is usually unbiased.

Missing at Random (MAR) Missingness is related to other observed variables, but not to the missing value itself. Example: younger participants are less likely to report their income, but within each age group, missingness is random. Multiple imputation methods can help here.

Missing Not at Random (MNAR) Missingness depends on the value of the missing data itself. Example: people with higher incomes are less likely to report their income. This is the most problematic — the missingness mechanism itself needs to be modelled.

Key takeaway: Only remove missing data casually when you have good reason to believe it is MCAR. For MAR or MNAR, seek specialist statistical advice.

7.3 Identifying missing data

7.3.0.1 Counting NAs

To count missing values per column:

colSums(is.na(testdata))

##           x           y           z randomnotes 
##           1           1           1           4

To count missing values per row:

rowSums(is.na(testdata))

## [1] 1 1 2 1 2

To count NAs in a single column:

sum(is.na(testdata$x))

## [1] 1

To include missing values in a frequency table:

table(testdata$y, useNA = "ifany")

## 
##    A    B    C    D <NA> 
##    1    1    1    1    1

7.3.0.2 Summary functions

summary() and skim() both report missing value counts automatically:

summary(testdata)

##        x             y                   z         randomnotes       
##  Min.   :1.00   Length:5           Min.   :10.00   Length:5          
##  1st Qu.:1.75   Class :character   1st Qu.:13.75   Class :character  
##  Median :2.50   Mode  :character   Median :17.50   Mode  :character  
##  Mean   :2.75                      Mean   :17.50                     
##  3rd Qu.:3.50                      3rd Qu.:21.25                     
##  Max.   :5.00                      Max.   :25.00                     
##  NA's   :1                         NA's   :1

skim(testdata)

Table 7.1: Data summary
Name	testdata
Number of rows	5
Number of columns	4
_______________________
Column type frequency:
character	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
y	1	0.8	1	1	0	4	0
randomnotes	4	0.2	5	5	0	1	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
x	1	0.8	2.75	1.71	1	1.75	2.5	3.50	5	▇▇▇▁▇
z	1	0.8	17.50	6.45	10	13.75	17.5	21.25	25	▇▇▁▇▇

7.3.0.3 Visualising missing data — `naniar`

The naniar package provides helpful visualisations for understanding the pattern of missing data across your dataset.

gg_miss_var(testdata)   # count of missing values per variable

vis_miss(testdata)      # heatmap showing where NAs are across the whole dataset

vis_miss() is particularly useful for spotting whether missingness clusters in particular rows or columns — which can give clues about whether data is MCAR, MAR, or MNAR.

7.4 Handling missing data

7.4.1 Remove rows with any NA — `na.omit()`

Listwise deletion removes any row that contains an NA in any column:

data_complete <- na.omit(testdata)
data_complete

## [1] x           y           z           randomnotes
## <0 rows> (or 0-length row.names)

Warning: If your dataset has columns with lots of NAs that are irrelevant to your analysis (like a free-text notes column), na.omit() will still remove rows because of them. Always check how many rows you lose:
nrow(testdata)       # before
nrow(data_complete)  # after
If you’ve lost more than expected, use the targeted approach below instead.

7.4.2 Remove rows where specific columns are NA

To remove rows only where particular columns are missing, use filter():

data_filtered <- testdata |>
  filter(!is.na(x) & !is.na(z))

data_filtered

##   x    y  z randomnotes
## 1 1    A 10        <NA>
## 2 2    B 15        <NA>
## 3 5 <NA> 25        <NA>

!is.na(x) means “keep rows where x is NOT missing”. This is more precise than na.omit() and is usually the better choice.

You can also use complete.cases() for the same result:

data_filtered <- testdata[complete.cases(testdata$x), ]

Or drop_na() from tidyr:

data_filtered <- testdata |>
  drop_na(x, z)

7.4.3 Replace NAs with a value

Sometimes you want to keep the row but fill in a specific value — for example, replacing NA with 0 or "Unknown".

Using ifelse():

testdata$x <- ifelse(is.na(testdata$x), 99, testdata$x)
testdata

##    x    y  z randomnotes
## 1  1    A 10        <NA>
## 2  2    B 15        <NA>
## 3  3    C NA        <NA>
## 4 99    D 20       hello
## 5  5 <NA> 25        <NA>

Using replace_na() from tidyr (cleaner for multiple columns at once):

testdata <- testdata |>
  replace_na(list(x = 0, y = "Unknown"))
testdata

##    x       y  z randomnotes
## 1  1       A 10        <NA>
## 2  2       B 15        <NA>
## 3  3       C NA        <NA>
## 4 99       D 20       hello
## 5  5 Unknown 25        <NA>

Note: Replacing NAs with a fixed value is a form of imputation. Only do this if you have a principled reason — for example, replacing NA counts with 0 when you know the true value is zero, not unknown.

7.5 Missing data in common functions

You don’t always need to remove missing data in advance. Many R functions have a built-in option to ignore NAs.

7.5.0.1 `na.rm = TRUE`

By default, most summary functions return NA if any values are missing:

mean(testdata$z)

## [1] NA

Add na.rm = TRUE to ignore missing values:

mean(testdata$z, na.rm = TRUE)

## [1] 17.5

sd(testdata$z,   na.rm = TRUE)

## [1] 6.454972

na.rm = TRUE works in mean(), median(), sd(), sum(), min(), max(), and most other summary functions.

7.5.0.2 Correlation and missing data

The cor() function doesn’t use na.rm — it has its own argument:

cor(testdata$x, testdata$z, use = "complete.obs")

## [1] 0.2919672

For a full correlation matrix, pairwise deletion uses all available data for each pair of variables, rather than removing entire rows:

cor(testdata, use = "pairwise.complete.obs")

Pairwise deletion retains more data than listwise deletion, but can result in different sample sizes per correlation — worth noting when interpreting results.

7.6 Missing data in regression models

Most modelling functions in R remove rows with missing values automatically. The question is how they do it — and this matters when you look at predictions.

model_omit    <- lm(z ~ x, data = testdata, na.action = na.omit)
model_exclude <- lm(z ~ x, data = testdata, na.action = na.exclude)

Method	What it does	When to use
`na.omit`	Removes rows with missing data; drops them from predictions	When you don’t need predictions for missing rows
`na.exclude`	Removes rows with missing data; keeps `NA` in predictions at the original row positions	When you need predictions to align with your original dataset

I recommend na.exclude unless you have a specific reason not to.

The difference becomes clear when you look at predictions:

predict(model_omit)     # predictions only for complete cases — shorter vector

##        1        2        4        5 
## 16.49309 16.53220 20.32520 16.64951

predict(model_exclude)  # predictions with NA preserved at missing rows — same length as original data

##        1        2        3        4        5 
## 16.49309 16.53220       NA 20.32520 16.64951

na.exclude keeps the structure of your dataset intact, which matters in time series or longitudinal analyses where the position of missing values is meaningful.

7.7 Command reference

A quick-copy summary of all commands covered in this tutorial. Replace testdata with your dataset name.

# -------------------------------------------
# Reading in data with non-standard NA codes
# -------------------------------------------
mydata <- read.csv("data.csv", na.strings = c("", "NA", "-1", "99", "9999"))
mydata <- read_excel("data.xlsx", na = c("", "NA", "-1", "999"))

# Force specific values to NA after reading in
mydata$column[mydata$column < 0] <- NA

# -------------------------------------------
# Identifying missing data
# -------------------------------------------
colSums(is.na(testdata))          # NAs per column
rowSums(is.na(testdata))          # NAs per row
sum(is.na(testdata$x))            # NAs in one column
table(testdata$y, useNA = "ifany") # frequency table including NAs
summary(testdata)                  # summary with NA counts
skim(testdata)                     # detailed summary with NA counts
gg_miss_var(testdata)              # naniar: NAs per variable (plot)
vis_miss(testdata)                 # naniar: NA heatmap (plot)

# -------------------------------------------
# Removing rows with missing data
# -------------------------------------------
data_complete <- na.omit(testdata)                          # remove any row with any NA
data_filtered <- testdata |> filter(!is.na(x) & !is.na(z)) # targeted: specific columns only
data_filtered <- testdata |> drop_na(x, z)                  # tidyr equivalent

# -------------------------------------------
# Replacing NAs with a value
# -------------------------------------------
testdata$x <- ifelse(is.na(testdata$x), 99, testdata$x)         # replace with 99
testdata <- testdata |> replace_na(list(x = 0, y = "Unknown"))   # replace multiple columns

# -------------------------------------------
# Handling NAs inside functions
# -------------------------------------------
mean(testdata$x, na.rm = TRUE)
sd(testdata$x,   na.rm = TRUE)
cor(testdata$x, testdata$z, use = "complete.obs")
cor(testdata, use = "pairwise.complete.obs")

# -------------------------------------------
# Regression with missing data
# -------------------------------------------
lm(z ~ x, data = testdata, na.action = na.omit)    # drops missing rows from predictions
lm(z ~ x, data = testdata, na.action = na.exclude) # preserves NA positions in predictions (recommended)

7.8 Troubleshooting

Function returned NA instead of a number Your column contains missing values. Add na.rm = TRUE:

mean(testdata$x, na.rm = TRUE)

na.omit() removed far more rows than expected You probably have a column with many NAs that isn’t relevant to your analysis (e.g. a notes column). Use filter(!is.na(columnname)) or drop_na(col1, col2) to target only the columns that matter.

cor() returned NA even with na.rm = TRUE cor() doesn’t use na.rm. Use use = "complete.obs" or use = "pairwise.complete.obs" instead.

Predictions from lm() are a different length to my dataset You used na.action = na.omit (the default). Switch to na.action = na.exclude to preserve NA positions in the prediction vector so it matches your original data length.

vis_miss() or gg_miss_var() not found The naniar package isn’t loaded. Add library(naniar) to your library code chunk.

6 Summarising Data

8 Plots