-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathR in Data Science - Breakout Session - Campus DevCon Summit.Rmd
465 lines (345 loc) · 9.84 KB
/
R in Data Science - Breakout Session - Campus DevCon Summit.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
---
title: "R in Data Science - Breakout Session - Campus DevCon Summit"
author: "Jose Marie Antonio Minoza"
date: '2023-09-08'
output: html_document
---
# Introduction to R
## What is R?
R is a powerful and versatile programming language and environment for statistical computing and data analysis. It was developed by statisticians and data scientists for a wide range of data-related tasks, from data manipulation and visualization to advanced statistical modeling.
## Why use R?
R is widely used in academia, industry, and research for several reasons:
- Open-source: R is free and open-source, making it accessible to anyone.
- Active community: A large and active community of users and developers provides support and contributes packages.
- Extensive libraries: R has a rich ecosystem of packages for various data analysis and visualization tasks.
- Data manipulation: R excels at data manipulation and transformation.
- Statistical analysis: R offers a wide range of statistical techniques and models.
- Data visualization: R has excellent tools for creating informative data visualizations.
# Data Types in R
## Numeric
Numeric data types are used to represent numbers, both integers and real numbers (floating-point).
```{r}
# Numeric variables
x <- 5
y <- 3.14
```
## Character
Character data types are used to store text and strings.
```{r}
# Character variables
name <- "John"
message <- "Hello, world!"
```
## Logical
Logical data types represent binary values - either TRUE or FALSE.
```{r}
# Logical variables
is_raining <- TRUE
is_sunny <- FALSE
```
## Factors
Factors are used to represent categorical data with predefined levels.
```{r}
# Creating a factor
gender <- factor(c("Male", "Female", "Male", "Female"))
```
# Operators
```{r}
a <- 1 + 1 # addition
b <- 2 * 9 # multiplication
c <- 3 - 1 # subtraction
d <- a / c # division
e <- b %% 5 # modulus
```
# Special Data Types
```{r}
NA
NaN
NULL
```
```{r}
1/0
90000^200
```
```{r}
Inf * -2
Inf - Inf
```
```{r}
-2023/Inf
2023/Inf
-2023/-Inf
2023/-Inf
```
```{r}
-2023/0
2023/0
```
## Data Frames
Data frames are used to store tabular data with rows and columns.
```{r}
# Creating a data frame
data <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22)
)
```
# Matrices in R
## Creating Matrices
Matrices are two-dimensional arrays in R.
```{r}
# Creating a matrix
matrix_data <- matrix(1:6, nrow = 2, ncol = 3)
```
## Matrix Operations
You can perform various operations on matrices, including addition, multiplication, and transposition.
```{r}
# Matrix operations
mat_a <- matrix(1:6, nrow = 2, ncol = 3)
mat_b <- matrix(7:12, nrow = 2, ncol = 3)
mat_sum <- mat_a + mat_b
mat_prod <- mat_a %*% t(mat_b)
```
## Indexing and Subsetting
You can access elements and subsets of matrices using indexing.
```{r}
# Matrix indexing and subsetting
first_row <- matrix_data[1, ]
second_col <- matrix_data[, 2]
subset_matrix <- matrix_data[1:2, 2:3]
```
```{r}
A <- matrix(data = c(1,2,3,4,5,6,7,8),nrow = 2, ncol = 4)
A
```
```{r}
A[1,2] # A[row,column]
A[,2] # return all rows of Column 2
A[1,] # return all columns of Row 1
```
```{r}
C <- matrix(data = 1:100,nrow = 5, ncol = 20)
C
```
```{r}
C[1,1:20][C[1,1:20]<10]
C[,1][C[,1]>10]
```
# Functions in R
## Built-in Functions
R provides numerous built-in functions for various purposes.
```{r}
# Built-in functions
mean_value <- mean(c(1, 2, 3, 4, 5))
sqrt_value <- sqrt(16)
```
```{r}
# Logarithms
log(x=144,base=12)
log(x=216,base=6)
# Exponentials
exp(x=3)
log(x=20.17)
```
```{r}
x <- c(2,0,2,3) # Creates vector with elements 2,0,0,4
y <- c(1,6,2,9)
x + y # Sums elements of two vectors
```
```{r}
x * 4 # Multiplies elements
```
```{r}
sqrt(x) # Function applies to each element
```
```{r}
vector <- c(1,2,3,4,5)
length(vector)
```
## Creating User-defined Functions
You can define your own functions in R using the function() keyword.
```{r}
# User-defined function
my_function <- function(x, y) {
result <- x^2 + y^2
return(result)
}
```
## Function Arguments
Functions can have arguments that allow you to pass values to them.
```{r}
# Function with arguments
calculate_area <- function(length, width) {
area <- length * width
return(area)
}
```
```{r}
calculate_area(4, 4)
```
```{r}
l <- seq(1,4)
w <- rep(5,4)
l
w
```
```{r}
calculate_area(l, w)
```
# The Apply Family of Functions
The apply family of functions in R is used for applying a function to the rows or columns of a matrix or data frame. It includes functions like apply, lapply, sapply, and mapply.
## apply()
The apply() function applies a specified function to either rows or columns of a matrix.
```{r}
# Example of apply() to calculate row sums
row_sums <- apply(matrix_data, 1, sum)
```
## lapply()
The lapply() function applies a function to each element of a list or vector and returns a list.
```{r}
# Example of lapply() to square each element of a list
numbers <- list(a = 1, b = 2, c = 3)
squared_numbers <- lapply(numbers, function(x) x^2)
```
## sapply()
The sapply() function is similar to lapply() but simplifies the result to a vector or matrix if possible.
```{r}
# Example of sapply() to square each element of a list
numbers <- list(a = 1, b = 2, c = 3)
squared_numbers <- sapply(numbers, function(x) x^2)
```
## mapply()
The mapply() function is used for multiple input vectors and applies a function element-wise.
```{r}
# Example of mapply() to calculate the sum of two vectors element-wise
vector1 <- c(1, 2, 3)
vector2 <- c(4, 5, 6)
sum_vector <- mapply(function(x, y) x + y, vector1, vector2)
```
# Data Science in R
## Visualization
### Basic Graphics (Base R)
Basic graphics in R allow you to create simple plots using functions like plot, hist, and boxplot. Let's create some basic visualizations.
#### Scatterplot
```{r}
head(iris, 5)
```
```{r}
# Scatterplot of Sepal Length vs. Sepal Width
plot(iris$Sepal.Length,
iris$Sepal.Width,
xlab = "Sepal Length",
ylab = "Sepal Width",
main = "Scatterplot of Sepal Length vs. Sepal Width")
```
```{r}
pacman::p_load('ggplot2')
# Scatterplot using ggplot2
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
labs(x = "Sepal Length", y = "Sepal Width", title = "Scatterplot of Sepal Length vs. Sepal Width")
```
```{r}
# Histogram of Petal Length
hist(iris$Petal.Length, xlab = "Petal Length", main = "Histogram of Petal Length")
```
```{r}
# Histogram of Petal Length using ggplot2
ggplot(iris, aes(x = Petal.Length)) +
geom_histogram(binwidth = 0.2, fill = "blue", color = "black") +
labs(x = "Petal Length", title = "Histogram of Petal Length")
```
```{r}
# Boxplot of Sepal Length by Species
boxplot(Sepal.Length ~ Species,
data = iris,
xlab = "Species",
ylab = "Sepal Length",
main = "Boxplot of Sepal Length by Species")
```
```{r}
# Boxplot of Sepal Length by Species using ggplot2
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot(fill = "green", color = "black") +
labs(x = "Species", y = "Sepal Length", title = "Boxplot of Sepal Length by Species")
```
## Regression Analysis
In this tutorial, we will explore regression analysis in R. Regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. We will cover the following topics:
1. Introduction to Regression
2. Loading and Preparing the Data
3. Simple Linear Regression
4. Multiple Linear Regression
5. Model Evaluation
### 1. Introduction to Regression
Regression analysis helps us understand how the value of a dependent variable changes when one or more independent variables change. It is commonly used for prediction and understanding relationships between variables.
### 2. Loading and Preparing the Data
#### Loading Data
Let's start by loading a dataset. For this tutorial, we'll use a sample dataset like the `mtcars` dataset, which contains information about car attributes.
```{r}
# Load the mtcars dataset
data(mtcars)
```
```{r}
head(mtcars)
```
###3. Simple Linear Regression
Simple linear regression is used when there is a linear relationship between the dependent variable and a single independent variable. Let's perform simple linear regression.
```{r}
# Fit a simple linear regression model
model_simple <- lm(mpg ~ hp, data = mtcars)
# Print the regression summary
summary(model_simple)
```
### 4. Multiple Linear Regression
Multiple linear regression extends simple linear regression to include multiple independent variables. Let's perform multiple linear regression.
```{r}
# Fit a multiple linear regression model
model_multiple <- lm(mpg ~ hp + wt + qsec, data = mtcars)
# Print the regression summary
summary(model_multiple)
```
### 5. Model Evaluation
To evaluate the regression models, we can use various metrics such as R-squared, adjusted R-squared, and residuals analysis.
```{r}
# Evaluate the simple linear regression model
rsq_simple <- summary(model_simple)$r.squared
adj_rsq_simple <- summary(model_simple)$adj.r.squared
# Evaluate the multiple linear regression model
rsq_multiple <- summary(model_multiple)$r.squared
adj_rsq_multiple <- summary(model_multiple)$adj.r.squared
```
```{r}
rsq_simple
adj_rsq_simple
```
```{r}
rsq_multiple
adj_rsq_multiple
```
```{r}
saveRDS(model_simple, "simple_regression_model.RDS")
```
```{r}
saveRDS(model_multiple, "final_regression_model.RDS")
```
# R Shiny
## Embedded Application
It's also possible to embed an entire Shiny application within an R Markdown document using the `shinyAppDir` function. This example embeds a Shiny application located in another directory:
### Tabsets
```{r, echo=FALSE}
# shinyAppDir(
# system.file("examples/06_tabsets", package = "shiny"),
# options = list(
# width = "100%", height = 550
# )
# )
```
### Widgets
```{r, echo=FALSE}
# shinyAppDir(
# system.file("examples/07_widgets", package = "shiny"),
# options = list(
# width = "100%", height = 550
# )
# )
```