-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01c_r-data-structures.qmd
299 lines (197 loc) · 8.13 KB
/
01c_r-data-structures.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: "Data types, structures and classes"
---
## Base types
> Every object has a base type. Overall there are 25 different base object types that forms the core of the R language.
## Base data types
Base **data types** form the building blocks of all data structures and objects in R.
> There are **5 base data types: `double`, `integer`, `complex`, `logical`, `character`** as well as `NULL`.
No matter how complicated your analyses become, all data in R is interpreted as one of these basic data types.
You can inspect the type of a value or object through function `typeof()`.
```{r}
typeof(3.14)
typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
typeof(TRUE)
typeof("banana")
typeof(NULL)
```
::: callout-note
## `NA` values
In R, NA stands for "Not Available" and is used to represent missing or undefined values. It serves as a placeholder to indicate that a value is not present or cannot be determined for a particular observation in a dataset.
There are different types of `NA` in R, including `NA_integer_`, `NA_real_`, `NA_complex_`, `NA_character_`, and `NA_logical_`, corresponding to different data types. These are used to represent missing values in specific data types. The default `NA` data type is `logical`.
```{r}
typeof(NA)
typeof(NA_character_)
```
:::
## Data Structures
### Arrays and type coercion
> The distinguishing feature of arrays is that all values are of the same data type.
Arrays can take values of any base data type and span any number of dimensions. However, **all values must be of the same base data type**. This allows for efficient calculation and matrix mathematics. The strictness also has some really important consequences which introduces another key concept in R, that of **type coercion**.
### Vectors and Type Coercion
#### Vectors
> Vectors are one dimensional arrays.
To better understand the importance of data types and coercion, let's meet a special case of an array, the **vector**.
One way to create a new vector is to use function `vector()`. You can specify the length of the vector with argument `length` and the base data type through argument `mode`.
```{r}
my_vector <- vector(length = 3)
my_vector
```
A vector in R is essentially an ordered collection of things, with the special condition that *everything in the vector must be the same basic data type*.
If you don't choose the datatype, it'll default to `logical`.
```{r}
typeof(my_vector)
```
Otherwise, you can declare an empty vector of whatever type you like using argument `mode`.
```{r}
another_vector <- vector(mode = "character", length = 3)
another_vector
```
You can also create a vector of **a series of numbers:**
```{r}
1:10
seq(10)
seq(1, 10, by = 0.1)
```
You can also create vectors by **combining individual elements using function `c`** (for combine).
```{r}
c(2, 6, 3)
```
### Type coercion
Q: Given what we've learned so far, what do you think the following will produce?
```{r}
c(2, 6, "3")
```
This is something called ***type coercion***, and it is the **source of many surprises** and the reason why we need to be aware of the basic data types and how R will interpret them.
When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type.
**Not all types can be coerced into another**, rather, **R has a coercion hierarchy rule**. All values are converted to the lowest data type in the hierarchy.
::: {.alert .alert-success}
##### R coercion rules:
**`logical` -\> `integer` -\> `numeric` -\> `complex` -\> `character`**
*where `->` can be read as "are transformed into".*
:::
In our case, our `2`, & `6` integer values where converted to character.
Some other examples:
```{r}
c("a", TRUE)
c("FALSE", TRUE)
c(0, TRUE)
```
You can try to force coercion against this flow using the `as.` functions:
```{r}
chars <- c("0", "2", "4")
as.numeric(chars)
as.logical(chars)
as.logical(as.numeric(chars))
as.logical(c(0, TRUE))
as.logical(c("FALSE", TRUE))
as.numeric(c("FALSE", TRUE))
as.numeric(as.logical(c("FALSE", TRUE)))
```
As you can see, some surprising things can happen when R forces one basic data type into another!
**If your data isn't the data type you expected, type coercion may well be to blame;** make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!
#### Inspecting vectors
We can ask a few questions about vectors:
```{r}
sequence_example <- seq(10)
head(sequence_example)
tail(sequence_example, n = 4)
length(sequence_example)
str(sequence_example)
```
The somewhat cryptic output from this command indicates the basic data type found in this vector - in this case `int`, integer; an indication of the number of things in the vector - actually, the indexes of the vector, in this case `[1:10]`; and a few examples of what's actually in the vector - in this case ascending integers.
#### Naming vectors
Finally, you can give names to elements in your vector:
```{r}
my_example <- 5:8
names(my_example) <- c("a", "b", "c", "d")
my_example
names(my_example)
```
[Find out more about vectors](https://r4ds.had.co.nz/vectors.html)
### Matrices
> Matrices are 2 dimensional arrays
The lengths of each dimension are defined by the number of rows and columns.
We can declare a matrix full of zeros:
```{r}
matrix_example <- matrix(0, ncol = 6, nrow = 3)
matrix_example
```
We can get the number of dimensions of a matrix (or of any array with dimensions \> 1) and their length.
```{r}
dim(matrix_example)
```
## Lists
> **Lists can store objects of any data type and class**
Another key data structure is the `list`. List are the **most flexible data structure** because each element can hold any object, of any data type and dimension, including other lists.
Create lists using `list()` or coerce other objects using `as.list()`.
```{r}
list(1, "a", TRUE)
```
```{r}
as.list(1:4)
```
We can name list elements:
```{r}
a_list <- list(title = "Numbers", numbers = 1:10, data = TRUE)
a_list
```
Lists are a base type:
```{r}
typeof(a_list)
```
## Data.frames
### S3, S4 and S6 objects
Arrays and lists are all immutable base types. However, there are other types of objects in R.
These are S3, S4 & S6 type objects, with S3 being the most common.
Such objects have **a class attribute** (base types can have a class attribute too), enabling class specific functionality, a characteristic of object oriented programming. New classes can be created by users, allowing greater flexibility in the types of data structures available for analyses.
[Learn more about object types](https://adv-r.hadley.nz/base-types.html)
### Data.frames
> **The most important S3 object class in R is the data.frame**.
>
> **Data.frames are special types of lists.**
Data.frames are **used to store tabular data** and are special types of lists where **each element is a vector**, **each of equal length**. So each column of a data.frame contains values of consistent data type but the data type can vary between columns (i.e. along rows).
```{r}
df <- data.frame(
id = 1:3,
treatment = c("a", "b", "b"),
complete = c(TRUE, TRUE, FALSE)
)
df
```
We can check that our data.frame is a list under the hood:
```{r}
typeof(df)
```
As an S3 object, it also has a class attribute:
```{r}
class(df)
```
We can check the dimensions of a data.frame
```{r}
dim(df)
```
Get a certain number of rows from the top or bottom
```{r}
head(df, 1)
```
```{r}
tail(df, 1)
```
Importantly, we can display the structure of a data.frame.
```{r}
str(df)
```
### A note on factors
Note that the default behaviour of `data.frame()` USED TO BE to covert character vectors to factors (this default changed as of R 4.0.0). Factors are another important data structure for handling categorical data, which have particular statistical properties. They can be useful during modelling and plotting but in the interest of time we will not be discuss them further here.
You can suppress R default behaviour using:
```{r}
df <- data.frame(
id = 1:3,
treatment = c("a", "b", "b"),
complete = c(TRUE, TRUE, FALSE),
stringsAsFactors = FALSE
)
str(df)
```
[Find out more about factors.](https://r4ds.had.co.nz/factors.html)