Pierrette Lo 7/10/2020
- Chapter 7.1-7.4
library(tidyverse)
-
The {skimr} package is super useful for EDA, with more summary statistics in a nicer format than with
summary()
. -
Bonus: check out the new Palmer Penguins dataset - packaged by Alison Horst, who also created some fun artwork to go with it.
# install {palmerpenguins} before using for the first time using the command below:
# remotes::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)
# take a look at first few rows of the data
head(penguins) %>% view()
Then get summary statistics with {skimr}
library(skimr)
## Warning: package 'skimr' was built under R version 3.6.3
skim(penguins)
Name | penguins |
Number of rows | 344 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 4 |
________________________ | |
Group variables | None |
Data summary
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
- categorical variable = bar chart
- continuous variable = histogram
- use
binwidth
inggplot()
andcut_width()
incount()
to bin a continuous variable into discrete intervals.
Example:
# without cut_width - get count for every possible value of carat
diamonds %>%
count(carat)
## # A tibble: 273 x 2
## carat n
## <dbl> <int>
## 1 0.2 12
## 2 0.21 9
## 3 0.22 5
## 4 0.23 293
## 5 0.24 254
## 6 0.25 212
## 7 0.26 253
## 8 0.27 233
## 9 0.28 198
## 10 0.290 130
## # ... with 263 more rows
# with cut_width - more manageable number of groups
diamonds %>%
count(cut_width(carat, 0.5))
## # A tibble: 11 x 2
## `cut_width(carat, 0.5)` n
## <fct> <int>
## 1 [-0.25,0.25] 785
## 2 (0.25,0.75] 29498
## 3 (0.75,1.25] 15977
## 4 (1.25,1.75] 5313
## 5 (1.75,2.25] 2002
## 6 (2.25,2.75] 322
## 7 (2.75,3.25] 32
## 8 (3.25,3.75] 5
## 9 (3.75,4.25] 4
## 10 (4.25,4.75] 1
## 11 (4.75,5.25] 1
geom_freqpoly()
gives you a histogram with lines instead of bars -
better for comparing different groups
# bars
diamonds %>%
ggplot(aes(x = carat, fill = cut)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# lines
diamonds %>%
ggplot(aes(x = carat, colour = cut)) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- Explore the distribution of each of the
x
,y
, andz
variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
A simple summary:
diamonds %>%
select(x:z) %>%
summary()
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
A more detailed summary using the {skimr} package:
skimr::skim(diamonds, x:z)
Name | diamonds |
Number of rows | 53940 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Data summary
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
x | 0 | 1 | 5.73 | 1.12 | 0 | 4.71 | 5.70 | 6.54 | 10.74 | ▁▁▇▃▁ |
y | 0 | 1 | 5.73 | 1.14 | 0 | 4.72 | 5.71 | 6.54 | 58.90 | ▇▁▁▁▁ |
z | 0 | 1 | 3.54 | 0.71 | 0 | 2.91 | 3.53 | 4.04 | 31.80 | ▇▁▁▁▁ |
Observations:
- According to the help (
?diamonds
),x
= length,y
= width,z
= depth - Looking at the summary,
x
andy
are very similar dimensions, whereasz
is smaller (makes sense when you imagine the shape of a typical diamond) - Both
y
andz
have max values (58.9 and 31.8, respectively) that look like they could be errors - The min values for all 3 variables are 0, which is probably also an error.
I also looked at the data graphically (see Chapter 12 for info on “pivoting” a table into a longer format):
diamonds %>%
pivot_longer(x:z, names_to = "dimension", values_to = "dim_in_mm") %>%
ggplot(aes(x = dim_in_mm)) +
geom_histogram(binwidth = 0.1) +
facet_wrap(~dimension, scales = "free")
Observations:
- The x-axis for
y
andz
also indicate the presence of large outliers.
Let’s zoom in on the x-axis to get a better idea of variability between these groups:
diamonds %>%
pivot_longer(x:z, names_to = "dimension", values_to = "dim_in_mm") %>%
ggplot(aes(x = dim_in_mm)) +
geom_histogram(binwidth = 0.1) +
facet_wrap(~dimension, scales = "free") +
coord_cartesian(xlim = c(0, 10))
diamonds %>%
arrange(desc(z))
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.51 Very Good E VS1 61.8 54.7 1970 5.12 5.15 31.8
## 2 2 Premium H SI2 58.9 57 12210 8.09 58.9 8.06
## 3 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
## 4 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
## 5 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43
## 6 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 7 4 Very Good I I1 63.3 58 15984 10.0 9.94 6.31
## 8 3.4 Fair D I1 66.8 52 15964 9.42 9.34 6.27
## 9 4.01 Premium J I1 62.5 62 15223 10.0 9.94 6.24
## 10 4.01 Premium I I1 61 61 15223 10.1 10.1 6.17
## # ... with 53,930 more rows
- It looks like
z
has the least variability, and againx
andy
are very similar once you remove the outliers. - The data look “spiky” - maybe because people tend to round to certain numbers instead of being precise?
- Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
With smaller binwidths, you start to notice a gap around $1500 (this is not apparent with larger binwidths).
diamonds %>%
ggplot(aes(x = price)) +
geom_histogram(binwidth = 500)
diamonds %>%
ggplot(aes(x = price)) +
geom_histogram(binwidth = 50)
diamonds %>%
ggplot(aes(x = price)) +
geom_histogram(binwidth = 10) +
coord_cartesian(xlim = c(0, 2500))
I dont know why there are no diamonds that cost $1500 - another error?
Also, if you decrease the binwidth further, the data again look “spiky”. My theory is that there are certain numbers that people use for price (e.g. ending in ‘99’) because it “looks better”.
- How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>%
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.01) +
coord_cartesian(xlim = c(0.99, 1))
There are almost no diamonds that are 0.99 carats. I assume this is because people have a tendency to round up. It’s interesting to see how a dataset consisting of hard numbers can still be so subjective!
- Compare and contrast
coord_cartesian()
vsxlim()
orylim()
when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
-
Default binwidth is 30 if you don’t specify
-
xlim()
andylim()
= “crop” - zoom in and remove data that fall outside the limits (you will get a warning message indicating how many data points were removed) -
coord_cartesian()
= “zoom” - zooms in without removing data -
coord_cartesian()
calculates binwidths and counts first, then zooms in, so some bars may be cut off -
xlim()
andylim()
remove data before calculating binwidths and counts, so bars are never cut off
diamonds %>%
ggplot(aes(x = carat)) +
geom_histogram() +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 5000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
diamonds %>%
ggplot(aes(x = carat)) +
geom_histogram() +
xlim(0, 1) +
ylim(0, 5000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 17502 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing missing values (geom_bar).
- What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
I used the penguins
dataset here because it contains missing values
(diamonds
does not).
In a bar chart – which is used for categorical variables – NAs are considered to be their own category.
penguins %>%
ggplot(aes(x = sex)) +
geom_bar()
In a histogram, the continuous x
variable is first grouped into bins,
then the number of observations in each bin is counted. NAs don’t belong
in any of the bins, so they get dropped (note warning message telling
you how many data points were removed).
penguins %>%
ggplot(aes(x = flipper_length_mm)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
- What does na.rm = TRUE do in mean() and sum()?
This argument is needed to ignore NAs when calculating the mean or sum.
Without na.rm = TRUE
:
mean(penguins$flipper_length_mm)
## [1] NA
With na.rm = TRUE
:
mean(penguins$flipper_length_mm, na.rm = T)
## [1] 200.9152
As an aside, I just learned that using T
and F
instead of TRUE
or
FALSE
is bad practice! TRUE
(1) and FALSE
(0) are “reserved” in
R, i.e. set in stone - you can’t edit them to have other values. You can
overwrite T and F and get some wacky results if you’ve forgotten you did
this. So always write out TRUE
and FALSE
. Sorry for misleading you
all before!
T <- 5 # this works
T <- FALSE # this also works
TRUE <- 5 # this doesn't
## Error in TRUE <- 5: invalid (do_set) left-hand side to assignment