R4DS Study Group - Week 14

Pierrette Lo 7/10/2020

This week’s assignment
Ch 7:3 Variation
Ch 7:4 Missing values

This week’s assignment

Chapter 7.1-7.4

library(tidyverse)

Notes

The {skimr} package is super useful for EDA, with more summary statistics in a nicer format than with summary().
Bonus: check out the new Palmer Penguins dataset - packaged by Alison Horst, who also created some fun artwork to go with it.

# install {palmerpenguins} before using for the first time using the command below:
# remotes::install_github("allisonhorst/palmerpenguins")

library(palmerpenguins)

# take a look at first few rows of the data

head(penguins) %>% view()

Then get summary statistics with {skimr}

library(skimr)

## Warning: package 'skimr' was built under R version 3.6.3

skim(penguins)


Name	penguins
Number of rows	344
Number of columns	7
_______________________
Column type frequency:
factor	3
numeric	4
________________________
Group variables	None

Data summary

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂

Ch 7:3 Variation

Notes

categorical variable = bar chart
continuous variable = histogram
use binwidth in ggplot() and cut_width() in count() to bin a continuous variable into discrete intervals.

Example:

# without cut_width - get count for every possible value of carat

diamonds %>% 
  count(carat)

## # A tibble: 273 x 2
##    carat     n
##    <dbl> <int>
##  1 0.2      12
##  2 0.21      9
##  3 0.22      5
##  4 0.23    293
##  5 0.24    254
##  6 0.25    212
##  7 0.26    253
##  8 0.27    233
##  9 0.28    198
## 10 0.290   130
## # ... with 263 more rows

# with cut_width - more manageable number of groups

diamonds %>% 
  count(cut_width(carat, 0.5))

## # A tibble: 11 x 2
##    `cut_width(carat, 0.5)`     n
##    <fct>                   <int>
##  1 [-0.25,0.25]              785
##  2 (0.25,0.75]             29498
##  3 (0.75,1.25]             15977
##  4 (1.25,1.75]              5313
##  5 (1.75,2.25]              2002
##  6 (2.25,2.75]               322
##  7 (2.75,3.25]                32
##  8 (3.25,3.75]                 5
##  9 (3.75,4.25]                 4
## 10 (4.25,4.75]                 1
## 11 (4.75,5.25]                 1

geom_freqpoly() gives you a histogram with lines instead of bars - better for comparing different groups

# bars

diamonds %>% 
  ggplot(aes(x = carat, fill = cut)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# lines

diamonds %>% 
  ggplot(aes(x = carat, colour = cut)) +
  geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercises

Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

A simple summary:

diamonds %>% 
  select(x:z) %>% 
  summary()

##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800

A more detailed summary using the {skimr} package:

skimr::skim(diamonds, x:z)


Name	diamonds
Number of rows	53940
Number of columns	10
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Data summary

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
x	1	5.73	1.12	4.71	5.70	6.54	10.74	▁▁▇▃▁
y	1	5.73	1.14	4.72	5.71	6.54	58.90	▇▁▁▁▁
z	1	3.54	0.71	2.91	3.53	4.04	31.80	▇▁▁▁▁

Observations:

According to the help (?diamonds), x = length, y = width, z = depth
Looking at the summary, x and y are very similar dimensions, whereas z is smaller (makes sense when you imagine the shape of a typical diamond)
Both y and z have max values (58.9 and 31.8, respectively) that look like they could be errors
The min values for all 3 variables are 0, which is probably also an error.

I also looked at the data graphically (see Chapter 12 for info on “pivoting” a table into a longer format):

diamonds %>% 
  pivot_longer(x:z, names_to = "dimension", values_to = "dim_in_mm") %>% 
  ggplot(aes(x = dim_in_mm)) +
  geom_histogram(binwidth = 0.1) +
  facet_wrap(~dimension, scales = "free")

Observations:

The x-axis for y and z also indicate the presence of large outliers.

Let’s zoom in on the x-axis to get a better idea of variability between these groups:

diamonds %>% 
  pivot_longer(x:z, names_to = "dimension", values_to = "dim_in_mm") %>% 
  ggplot(aes(x = dim_in_mm)) +
  geom_histogram(binwidth = 0.1) +
  facet_wrap(~dimension, scales = "free") +
  coord_cartesian(xlim = c(0, 10))

diamonds %>% 
  arrange(desc(z))

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15 31.8 
##  2  2    Premium   H     SI2      58.9  57   12210  8.09 58.9   8.06
##  3  5.01 Fair      J     I1       65.5  59   18018 10.7  10.5   6.98
##  4  4.5  Fair      J     I1       65.8  58   18531 10.2  10.2   6.72
##  5  4.13 Fair      H     I1       64.8  61   17329 10     9.85  6.43
##  6  3.65 Fair      H     I1       67.1  53   11668  9.53  9.48  6.38
##  7  4    Very Good I     I1       63.3  58   15984 10.0   9.94  6.31
##  8  3.4  Fair      D     I1       66.8  52   15964  9.42  9.34  6.27
##  9  4.01 Premium   J     I1       62.5  62   15223 10.0   9.94  6.24
## 10  4.01 Premium   I     I1       61    61   15223 10.1  10.1   6.17
## # ... with 53,930 more rows

It looks like z has the least variability, and again x and y are very similar once you remove the outliers.
The data look “spiky” - maybe because people tend to round to certain numbers instead of being precise?

Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

With smaller binwidths, you start to notice a gap around $1500 (this is not apparent with larger binwidths).

diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram(binwidth = 500)

diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram(binwidth = 50)

diamonds %>% 
  ggplot(aes(x = price)) + 
  geom_histogram(binwidth = 10) +
  coord_cartesian(xlim = c(0, 2500))

I dont know why there are no diamonds that cost $1500 - another error?

Also, if you decrease the binwidth further, the data again look “spiky”. My theory is that there are certain numbers that people use for price (e.g. ending in ‘99’) because it “looks better”.

How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.01) +
  coord_cartesian(xlim = c(0.99, 1))

There are almost no diamonds that are 0.99 carats. I assume this is because people have a tendency to round up. It’s interesting to see how a dataset consisting of hard numbers can still be so subjective!

Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

Default binwidth is 30 if you don’t specify
xlim() and ylim() = “crop” - zoom in and remove data that fall outside the limits (you will get a warning message indicating how many data points were removed)
coord_cartesian() = “zoom” - zooms in without removing data
coord_cartesian() calculates binwidths and counts first, then zooms in, so some bars may be cut off
xlim() and ylim() remove data before calculating binwidths and counts, so bars are never cut off

diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram() +
  coord_cartesian(xlim = c(0, 1), ylim = c(0, 5000))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram() +
  xlim(0, 1) +
  ylim(0, 5000)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 17502 rows containing non-finite values (stat_bin).

## Warning: Removed 3 rows containing missing values (geom_bar).

Ch 7:4 Missing values

Exercises

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

I used the penguins dataset here because it contains missing values (diamonds does not).

In a bar chart – which is used for categorical variables – NAs are considered to be their own category.

penguins %>% 
  ggplot(aes(x = sex)) +
  geom_bar()

In a histogram, the continuous x variable is first grouped into bins, then the number of observations in each bin is counted. NAs don’t belong in any of the bins, so they get dropped (note warning message telling you how many data points were removed).

penguins %>% 
  ggplot(aes(x = flipper_length_mm)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bin).

What does na.rm = TRUE do in mean() and sum()?

This argument is needed to ignore NAs when calculating the mean or sum.

Without na.rm = TRUE:

mean(penguins$flipper_length_mm)

## [1] NA

With na.rm = TRUE:

mean(penguins$flipper_length_mm, na.rm = T)

## [1] 200.9152

As an aside, I just learned that using T and F instead of TRUE or FALSE is bad practice! TRUE (1) and FALSE (0) are “reserved” in R, i.e. set in stone - you can’t edit them to have other values. You can overwrite T and F and get some wacky results if you’ve forgotten you did this. So always write out TRUE and FALSE. Sorry for misleading you all before!

T <- 5      # this works

T <- FALSE  # this also works

TRUE <- 5   # this doesn't

## Error in TRUE <- 5: invalid (do_set) left-hand side to assignment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r4ds_week14.md

r4ds_week14.md

R4DS Study Group - Week 14

This week’s assignment

Notes

Ch 7:3 Variation

Notes

Exercises

Ch 7:4 Missing values

Exercises

Files

r4ds_week14.md

Latest commit

History

r4ds_week14.md

File metadata and controls

R4DS Study Group - Week 14

This week’s assignment

Notes

Ch 7:3 Variation

Notes

Exercises

Ch 7:4 Missing values

Exercises