This tutorial is all about exploratory data analysis
Most of the data we will look at is in “data.frame” format. This is a table, just like an excel spreadsheet, with one row for each observation and one column for each variable. Each column has a column name.
In this tutorial, I will focus on in-built R datasets.
Let’s choose one now. I’m going to work with the pirates dataset from the yarrr package. We can choose the data here.
library(yarrr)
library(tidyverse)
?pirates
piratedataset <- pirates
To have a look at the data there are many options. You can:
piratedataset into the console or a code chunk)View(variable_name) (View is a command from the tidyverse package).head(variable_name) to see the first 6 lines or so (good for quick checks)glimpse(variable_name) to get a nice summary.names(variable_name) to get the column names.For example
# Note, there are more columns to the right, use the arrow to see
head(piratedataset)
##   id    sex age height weight headband college tattoos tchests parrots
## 1  1   male  28 173.11   70.5      yes   JSSFP       9       0       0
## 2  2   male  31 209.25  105.6      yes   JSSFP       9      11       0
## 3  3   male  26 169.95   77.1      yes    CCCC      10      10       1
## 4  4 female  31 144.29   58.5       no   JSSFP       2       0       2
## 5  5 female  41 157.85   58.4      yes   JSSFP       9       6       4
## 6  6   male  26 190.20   85.4      yes    CCCC       7      19       0
##   favorite.pirate sword.type eyepatch sword.time beard.length
## 1    Jack Sparrow    cutlass        1       0.58           16
## 2    Jack Sparrow    cutlass        0       1.11           21
## 3    Jack Sparrow    cutlass        1       1.44           19
## 4    Jack Sparrow   scimitar        1      36.11            2
## 5            Hook    cutlass        1       0.11            0
## 6    Jack Sparrow    cutlass        1       0.59           17
##             fav.pixar grogg
## 1      Monsters, Inc.    11
## 2              WALL-E     9
## 3          Inside Out     7
## 4          Inside Out     9
## 5          Inside Out    14
## 6 Monsters University     7
To see what the column names are, you can use the names(dataset) command
names(piratedataset)
##  [1] "id"              "sex"             "age"             "height"         
##  [5] "weight"          "headband"        "college"         "tattoos"        
##  [9] "tchests"         "parrots"         "favorite.pirate" "sword.type"     
## [13] "eyepatch"        "sword.time"      "beard.length"    "fav.pixar"      
## [17] "grogg"
Or the glimpse command:
glimpse(piratedataset)
## Rows: 1,000
## Columns: 17
## $ id              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ sex             <chr> "male", "male", "male", "female", "female", "male", "f…
## $ age             <dbl> 28, 31, 26, 31, 41, 26, 31, 31, 28, 30, 25, 20, 24, 26…
## $ height          <dbl> 173.11, 209.25, 169.95, 144.29, 157.85, 190.20, 158.05…
## $ weight          <dbl> 70.5, 105.6, 77.1, 58.5, 58.4, 85.4, 59.6, 74.5, 68.7,…
## $ headband        <chr> "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes",…
## $ college         <chr> "JSSFP", "JSSFP", "CCCC", "JSSFP", "JSSFP", "CCCC", "J…
## $ tattoos         <dbl> 9, 9, 10, 2, 9, 7, 9, 5, 12, 12, 10, 14, 8, 9, 14, 8, …
## $ tchests         <dbl> 0, 11, 10, 0, 6, 19, 1, 13, 37, 69, 1, 5, 6, 12, 70, 3…
## $ parrots         <dbl> 0, 0, 1, 2, 4, 0, 7, 7, 2, 4, 3, 3, 0, 3, 0, 1, 0, 3, …
## $ favorite.pirate <chr> "Jack Sparrow", "Jack Sparrow", "Jack Sparrow", "Jack …
## $ sword.type      <chr> "cutlass", "cutlass", "cutlass", "scimitar", "cutlass"…
## $ eyepatch        <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, …
## $ sword.time      <dbl> 0.58, 1.11, 1.44, 36.11, 0.11, 0.59, 3.01, 0.06, 0.74,…
## $ beard.length    <dbl> 16, 21, 19, 2, 0, 17, 1, 1, 1, 25, 1, 27, 0, 19, 0, 1,…
## $ fav.pixar       <chr> "Monsters, Inc.", "WALL-E", "Inside Out", "Inside Out"…
## $ grogg           <dbl> 11, 9, 7, 9, 14, 7, 9, 12, 16, 9, 7, 8, 12, 7, 9, 10, …
To see how many columns and rows there are, you can use the nrow() and ncol() commands
nrow(piratedataset)
## [1] 1000
ncol(piratedataset)
## [1] 17
To look at the summaries there are a load of options. Choose your favourites:
summary(dataset)skim(dataset) in the skimr packagesummarize(dataset) in the papeR package. This looks pretty powerful, I’m just learning itOr look at the summary
summary(piratedataset) 
##        id             sex                 age            height     
##  Min.   :   1.0   Length:1000        Min.   :11.00   Min.   :129.8  
##  1st Qu.: 250.8   Class :character   1st Qu.:24.00   1st Qu.:161.4  
##  Median : 500.5   Mode  :character   Median :27.00   Median :169.9  
##  Mean   : 500.5                      Mean   :27.36   Mean   :170.2  
##  3rd Qu.: 750.2                      3rd Qu.:31.00   3rd Qu.:178.5  
##  Max.   :1000.0                      Max.   :46.00   Max.   :209.2  
##      weight         headband           college             tattoos      
##  Min.   : 33.00   Length:1000        Length:1000        Min.   : 0.000  
##  1st Qu.: 62.08   Class :character   Class :character   1st Qu.: 7.000  
##  Median : 69.55   Mode  :character   Mode  :character   Median :10.000  
##  Mean   : 69.74                                         Mean   : 9.429  
##  3rd Qu.: 76.90                                         3rd Qu.:12.000  
##  Max.   :105.60                                         Max.   :19.000  
##     tchests          parrots       favorite.pirate     sword.type       
##  Min.   :  0.00   Min.   : 0.000   Length:1000        Length:1000       
##  1st Qu.:  6.00   1st Qu.: 1.000   Class :character   Class :character  
##  Median : 15.00   Median : 2.000   Mode  :character   Mode  :character  
##  Mean   : 22.69   Mean   : 2.819                                        
##  3rd Qu.: 30.00   3rd Qu.: 4.000                                        
##  Max.   :147.00   Max.   :27.000                                        
##     eyepatch       sword.time        beard.length    fav.pixar        
##  Min.   :0.000   Min.   :  0.0000   Min.   : 0.00   Length:1000       
##  1st Qu.:0.000   1st Qu.:  0.2175   1st Qu.: 0.00   Class :character  
##  Median :1.000   Median :  0.5850   Median : 9.00   Mode  :character  
##  Mean   :0.658   Mean   :  2.5427   Mean   :10.38                     
##  3rd Qu.:1.000   3rd Qu.:  1.3300   3rd Qu.:20.00                     
##  Max.   :1.000   Max.   :169.8800   Max.   :40.00                     
##      grogg      
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :10.00  
##  Mean   :10.14  
##  3rd Qu.:12.00  
##  Max.   :21.00
In the skimr library there is the skim command
library(skimr)
skim(piratedataset) 
| Name | piratedataset | 
| Number of rows | 1000 | 
| Number of columns | 17 | 
| _______________________ | |
| Column type frequency: | |
| character | 6 | 
| numeric | 11 | 
| ________________________ | |
| Group variables | None | 
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace | 
|---|---|---|---|---|---|---|---|
| sex | 0 | 1 | 4 | 6 | 0 | 3 | 0 | 
| headband | 0 | 1 | 2 | 3 | 0 | 2 | 0 | 
| college | 0 | 1 | 4 | 5 | 0 | 2 | 0 | 
| favorite.pirate | 0 | 1 | 4 | 12 | 0 | 6 | 0 | 
| sword.type | 0 | 1 | 5 | 8 | 0 | 4 | 0 | 
| fav.pixar | 0 | 1 | 2 | 19 | 0 | 15 | 0 | 
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | 
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 500.50 | 288.82 | 1.00 | 250.75 | 500.50 | 750.25 | 1000.00 | ▇▇▇▇▇ | 
| age | 0 | 1 | 27.36 | 5.79 | 11.00 | 24.00 | 27.00 | 31.00 | 46.00 | ▁▅▇▃▁ | 
| height | 0 | 1 | 170.23 | 12.39 | 129.83 | 161.36 | 169.86 | 178.54 | 209.25 | ▁▅▇▅▁ | 
| weight | 0 | 1 | 69.74 | 10.82 | 33.00 | 62.08 | 69.55 | 76.90 | 105.60 | ▁▃▇▅▁ | 
| tattoos | 0 | 1 | 9.43 | 3.37 | 0.00 | 7.00 | 10.00 | 12.00 | 19.00 | ▁▃▇▃▁ | 
| tchests | 0 | 1 | 22.69 | 24.46 | 0.00 | 6.00 | 15.00 | 30.00 | 147.00 | ▇▂▁▁▁ | 
| parrots | 0 | 1 | 2.82 | 3.21 | 0.00 | 1.00 | 2.00 | 4.00 | 27.00 | ▇▁▁▁▁ | 
| eyepatch | 0 | 1 | 0.66 | 0.47 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ | 
| sword.time | 0 | 1 | 2.54 | 9.33 | 0.00 | 0.22 | 0.58 | 1.33 | 169.88 | ▇▁▁▁▁ | 
| beard.length | 0 | 1 | 10.38 | 10.31 | 0.00 | 0.00 | 9.00 | 20.00 | 40.00 | ▇▂▅▂▁ | 
| grogg | 0 | 1 | 10.14 | 3.07 | 0.00 | 8.00 | 10.00 | 12.00 | 21.00 | ▁▅▇▃▁ | 
 
Website created and maintained by Helen Greatrex. Website template by Noli Brazil