-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01g_tidyverse.qmd
185 lines (117 loc) · 5.89 KB
/
01g_tidyverse.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: "Introduction to the tidyverse"
subtitle: "The tidyverse way"
---
## Intro to the tidyverse
::: {.alert .alert-info}
The [Tidyverse](https://www.tidyverse.org/) is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy.
### Advantages of the tidyverse
- Consistent functions.
- Workflow coverage.
- A parsimonious approach to the development of data science tools.
:::


### Tidyverse Principles
- `tibbles` as main data structures.
- Tidy data where rows are single observations and columns the variables observed.
- Piping the outputs of tidyverse functions as inputs to subsequent functions.
```{r}
#| eval: false
install.packages(c("tibble", "dplyr"))
```
------------------------------------------------------------------------
## `tibbles`
> `tibbles` are one of the unifying features of the tidyverse, and are the tidyverse version of a data.frame (I will use them interchangeably in the rest of the text).
### Features
- Better printing behaviour.
- Never coerces characters to factors.
- More robust error handling.
### Creating tibbles
#### Coercing data.frames
You can coerce a data.frame to a tibble
```{r}
tree_tbl <- tibble::as_tibble(trees)
tree_tbl
```
As you can see, printing tibbles is much tidier and informative and designed so that you don’t accidentally overwhelm your console when you print large data.frames.
#### Creating new tibbles
You can create a new tibble from individual vectors with `tibble()`. `tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created:
```{r}
tibble::tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
```
## Subsetting tibbles
### Base R subsetting
We can use all the tools we learnt to subset data.frames to subset tibbles.
### Subsetting using the tidyverse
You can also subset `tibbles` using tidyverse functions from package `dplyr`. `dplyr` verbs are inspired by SQL vocabulary and designed to be more intuitive.
Let's load `dplyr`:
```{r}
#| message: false
#| warning: false
library(dplyr)
```
The first argument of the main `dplyr` functions is a `tibble` (or data.frame)
#### Filtering rows with `filter()`
`filter()` allows us to subset observations (rows) based on their values. The first argument is the name of the data frame. The **second and subsequent arguments are the expressions that filter the data frame.**
```{r}
filter(tree_tbl, Girth > 14)
```
`dplyr` executes the filtering operation by generating a logical vector and returns a new `tibble` of the rows that match the filtering conditions. You can therefore use any logical operators we learnt using `[`.
Note as well that using `dplyr` allows us to use bare column names, making conditional statements much more clear and concise.
#### Slicing rows with `slice()`
Using `slice()` is similar to subsetting using element indices in that we provide element indices to select rows.
```{r}
slice(tree_tbl, 2)
slice(tree_tbl, 2:5)
```
#### Selecting columns with `select()`
`select()` allows us to subset columns in tibbles using operations based on the names of the variables.
In `dplyr` we use **unquoted column names** (ie `Volume` rather than `"Volume"`).
```{r}
select(tree_tbl, Height, Volume)
```
Behind the scenes, `select` matches any variable arguments to column names creating a vector of column indices. This is then used to subset the `tibble`.
As such we can create ranges of variables using their names and `:`
```{r}
select(tree_tbl, Height:Volume)
```
There's also a number of helper functions to make selections easier. For example, we can use `one_of()` to provide a character vector of column names to select.
```{r}
select(tree_tbl, one_of(c("Height", "Volume")))
```
[Find out more about `dplyr` helper functions](https://tidyselect.r-lib.org/reference/select_helpers.html)
## The pipe operator `%>%`
> **Pipes** are a powerful tool for **clearly expressing a sequence of multiple operations.**
> They help us write code in a way that is **easier to read and understand**. They also **remove the need for creating intermediate objects**.
::: {alert="" alert-info=""}
Pipes take the output of the evaluation of the preceeding code and pipe it as the first argument to the subsequent expression.
:::
Suppose we want to get the first two rows and only columns `Girth` and `Volume`. We can chain the two operations together using the pipe.
```{r}
tree_tbl %>%
select(Girth, Volume) %>%
slice(1:2)
```
This form is very understandable because it focuses on intuitive verbs, not nouns. You can read this series of function compositions like it’s a set of imperative actions.
As mentioned, the **default behaviour** of the pipe is to pipe objects through **as the first argument** of the next expression. However, we can **pipe the object into a different argument using the `.` operator**.
```{r}
tree_tbl %>%
lm(Girth ~ Height, data = .)
```
*Note: The pipe, `%>%`, comes from the `magrittr` package by Stefan Milton Bache. Packages in the tidyverse load `%>%` for you automatically, so you don’t usually load `magrittr` explicitly.*
## The base R pipe operator
**Up until R version 4.1.0, the only pipe operator available was the `magrittr` pipe `%>%`,** hence I've introduced the pipe as part of the tidyverse section.
However, **as of R version 4.1.0 a native base R pipe operator was introduced, `|>`** which is equivalent to the `magrittr` pipe `%>%` and is available without loading any packages!
::: callout-tip
One difference between the **base R pipe operator `|>`** and `%>%` is that it is **not designed to work with functions that take the piped argument as anything but the first argument** so it is not as flexible as `%>%`.
:::
## Data Transformation with `dplyr` Cheat Sheet
```{r}
#| echo: false
knitr::include_url("assets/cheatsheets/data-transformation.pdf")
```