Pierrette Lo 8/7/2020
- Chapter 7.5-7.8
library(tidyverse)
- Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using
cut_width()
vscut_number()
? How does that impact a visualisation of the 2d distribution ofcarat
andprice
?
cut_width()
= specify the size of each bincut_number()
= specify number of bins
Things to consider:
- number of overlapping lines
- number of different colours for the eye to distinguish
- whether the bin boundaries are “nice”
E.g. too many lines, too many different colours:
diamonds %>%
ggplot(aes(x = carat)) +
geom_freqpoly(aes(color = cut_width(price, 1000)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bin boundaries not easy to interpret:
diamonds %>%
ggplot(aes(x = carat)) +
geom_freqpoly(aes(color = cut_number(price, 5)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- Visualise the distribution of
carat
, partitioned byprice
.
Note that the bins on the y axis are denoted using set notation.
- Square bracket = number included
- Round bracket = number excluded
Use boundary = 0
to make sure 0 is included in the first interval
diamonds %>%
ggplot(aes(x = carat, y = cut_width(price, 2000, boundary = 0))) +
geom_boxplot() +
ylab("price, binned")
- How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
Plot price
distribution within bins of carat
:
diamonds %>%
ggplot(aes(x = price, y = cut_width(carat, 1, boundary = 0))) +
geom_boxplot() +
ylab("carat, binned")
It looks like the largest diamonds have the least variation in price. This might be because they are more rare, so maybe they’re always more expensive? Smaller diamonds might have more variation in other factors (cut, clarity, color) that also impact price.
Note that the solutions
manual
might be wrong - I think they were looking at the previous plot of
carat
distribution by price
when they said there was more variation
in the largest diamonds (I think they were looking at the largest price
bin).
- Combine two of the techniques you’ve learned to visualise the combined distribution of
cut
,carat
, andprice
.
Not sure if this is combining 2 techniques, but it looks cool:
library(hexbin)
ggplot(diamonds, aes(x = carat, y = price)) +
geom_hex(aes(fill = cut), color = "grey")
- Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of
x
andy
values, which makes the points outliers even though theirx
andy
values appear normal when examined separately.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
Remember that the x
and y
the question refers to are the dimensions
of a diamond in mm.
A binned plot doesn’t highlight the outliers or the strong relationship
between x
and y
.
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
As the text mentions, you can leave out the x =
and y =
(in addition
to data =
and mapping =
) to make your code more concise. However, I
like to leave in the x
and y
as I think it makes the code more
readable (which to me is more important than just being concise).
Chapter 10 and first half of Chapter 11.