-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathST558_Project_3.Rmd
680 lines (469 loc) · 32.1 KB
/
ST558_Project_3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
---
title: "Article Shares Prediction"
author: "Shyam Gadhwala & Kamlesh Pandey"
output: github_document
params:
data_channel: "data_channel_is_lifestyle"
---
# Library
```{r lib_import, warning=FALSE, message=FALSE}
library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(caret)
library(glmnet)
```
```{r, include=FALSE, eval=TRUE, results='hide'}
nameOFProj = params$data_channel
nameOFProj <- str_split(nameOFProj, "_")[[1]][4]
nameOFProj <- toupper(nameOFProj)
```
# Introdcution
We are living in a digital world where people are more concerned about the digital footprint and people often consider the like and shares they get on a post that they publish online as an important metric. We have several social media websites and print media to share our articles. Online print media platform like [Mashable](www.mashable.com) publishes thousand of online media everyday and it is important for them get a more user engagement from the the article they post online. In this data set we are trying to predict the shares using that data that would have several potential influencing factors, in other words, predictors, such as text sentiment polarity, rate of negative words, rate of positive words, to name some.
Why this analysis is important ?
From this predictive model, Mashable can potentially predict the number of shares they can receive based on the type of article they are publishing online.
# Data
The data set summarizes a heterogeneous set of features about articles published by in a period of two years.The data set has 39644 entries and 61 feature columns. The project is aimed to subset the original data set based on type of data channel (one of the six type) and then to predict the number of shares.
For this part of the project we are using **`r nameOFProj`** data channel for training and building a predictive modeling and extending same models to other five data channels.
```{r dataset, warning=FALSE, message=FALSE}
newspopData <- read_csv('OnlineNewsPopularity.csv')
# select specified data channel from params and drop other data channel columns
newspopData <- newspopData %>% filter(newspopData[params$data_channel] == 1) %>%
select( -starts_with('data_channel_is_'), -url)
newspopData
```
# Exploratory Data Analysis
```{r}
summary(newspopData$shares)
```
Dividing the popularity of the shares based on quantiles to classify them as to how popular they are. The 1st quarter percentile would be popularity index 4, next quarter be index 3, so on and so forth. Higher the number of shares of that article, higher the popularity index.
```{r}
pShares <- newspopData$shares
pShares <- sort(pShares)
quantileValues <- quantile(pShares, c(.25, .5, .75))
newspopData <- newspopData %>% mutate(article_popularity_index = if_else(shares<quantileValues[1], "4",
if_else(shares<quantileValues[2], "3",
if_else(shares<quantileValues[3],"2", "1"))))
newspopData$article_popularity_index <- as_factor(newspopData$article_popularity_index)
```
Starting EDA with the visualization of the target variable shares as a function of the most obvious factor, number of words in the article:
```{r eda, warning=FALSE, message=FALSE}
newspopData$scaledShare <- scale(newspopData$shares, center = T, scale = T)
#calculating correlation index
corrIndex <- round(cor(newspopData$scaledShare, newspopData$n_tokens_content),2)
# plotting the shares as a function of words in an article
ggplot(newspopData, aes(x= n_tokens_content, y = scaledShare)) +
geom_point()+
labs(subtitle = 'Word Count v/s Number of Shares',
y = 'Number of Share (Scaled)',
x = 'Number of Words in article',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = 'Source: News Popularity Dataset') +
geom_text(x = max(newspopData$n_tokens_content)/2,
y = max(newspopData$scaledShare)/2, size = 4,
label = paste0("Correlation coefficient = ", corrIndex), color = 'red')
```
We can inspect trend of Number of shares as a function of Number of words in the article. If the points show an upward trend, then the article with high number of words are shared more. However, if we see a negative trend then we can estimate that with increasing number of words in the article, number of shares decreases. This trend can be investigated further with the correlation coefficientm which in this case is `r corrIndex`.
Now, creating a new variable here that help eliminates the dummy variable for each day of the week. This new variable will take the value of each day of the week that corresponds to the dummy variable's value:
```{r, warning=FALSE}
newspopData <- newspopData %>% select(-scaledShare)
newspopData <- newspopData %>% mutate(day = if_else(weekday_is_monday== 1, "Monday",
if_else(weekday_is_tuesday== 1, "Tuesday",
if_else(weekday_is_wednesday==1, "Wednesday",
if_else(weekday_is_thursday== 1, "Thursday",
if_else(weekday_is_friday== 1, "Friday",
if_else(weekday_is_saturday== 1,"Saturday",
if_else(weekday_is_sunday== 1, "Sunday",
"-"))))))))
newspopData$day <- as_factor(newspopData$day)
```
Some statistics based on the variables is as shown:
```{r, message=FALSE, warning=FALSE}
knitr::kable(newspopData %>% group_by(day) %>% summarize(Count_of_Articles = n()))
```
```{r, message=FALSE, warning=FALSE}
knitr::kable(newspopData %>% group_by(day, article_popularity_index) %>% summarize(Count_of_Articles = n()))
```
This table shows the shares of articles on each day of the week. For tech based articles, we can expect a rise in shares in mid week or Fridays or after any tech has been launched. For entertainment and lifestyle articles, post-weekend period would be the most active period of sharing. While world, social media and business would not have a definitive trend.
```{r, message=FALSE, warning=FALSE}
ggplot(newspopData, aes(x = day, y = shares/1000000)) +
geom_bar(stat = 'identity', width = 0.5, fill = 'tomato3')+
labs(subtitle = 'Number of Shares (Million) Vs Day of Week',
caption = 'Source : News Popularity Dataset',
y = 'Total Share count in Million',
x = 'Day of the Week',
title = toupper(str_replace_all(params$data_channel, "_", " ")),) +
theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
```
From this bar chart we can visualize the shares trend across the week. The users engagement with the type of data Chanel (tech, entertainment, politics) will be different across the week. Users may be more inclined toward sharing lifestyle and entertainment news during weekend and prefer less to share technology/science related news during same time.
```{r, message=FALSE, warning=FALSE}
ggplot(newspopData, aes(x = global_rate_positive_words, y = global_sentiment_polarity)) +
geom_point(aes(col = day, shape = article_popularity_index)) +
geom_smooth(aes(col = day), method = 'lm', se = F) +
scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
x = 'Global Positive Word Rate',
y = 'Global Sentiment Polarity',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
color = 'Days',
size = 'Shares per Million') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
```
```{r, message=FALSE, warning=FALSE}
ggplot(newspopData, aes(x = global_rate_negative_words, y = global_sentiment_polarity)) +
geom_point(aes (col = day, shape = article_popularity_index)) +
geom_smooth(aes(col = day), method = 'lm', se = F) +
scale_shape_discrete(name="Article Popularity Index", labels = c("4", "3", "2", "1"))+
labs(subtitle = 'Positive Rate VS Sentiment Polarity Plot',
x = 'Global Negative Word Rate',
y = 'Global Sentiment Polarity',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
color = 'Days',
size = 'Shares per Million') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
```
The above plots estimate the general sentiments of users based on positive and negative words in the content. Ideal expectations would be: an article with more positive words has a positive sentiment index and more shares, and vice versa. Exceptions may be found in peculiar cases.
Plotting the number of shares based on the number of images and videos that an article has, based on what day of the week it is:
```{r, warning=FALSE, fig.width=20, fig.height=8, message=FALSE}
ggplot(newspopData, aes(x=day, y = shares/1000000)) +
geom_bar(aes(fill = as_factor(num_imgs)), stat="identity", position="dodge") +
scale_fill_discrete(name = 'Number of Images \n in Article') +
labs(subtitle = 'Number of Images in article v/s number of shares',
x = 'Number of Images',
y = 'Number of shares (in millions)',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset") +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
```
This plot shows the number of shares (in millions) based on number of images included in each articles for each day of the week. The general trend might see a increase in sharing of articles over Fridays and weekends when there is more leisure time. Moreover, tech based articles can also see a rise in share if any new technology is released in middle of the week. Entertainment and Lifestyles articles can see a rise in share if any events happened over the weekend which is the case most of the times. World and social media Articles should not have a trend, as events from all over the world keep on happening throughout the week and updates are posted on social media non stop. Business articles would be more trending during working days.
in addition to that, not all articles would be oriented towards images. For example, tech and business based articles readers would not care about the number of images included, but only the content itself, while for entertainment, world, lifestyle articles, images are one of the major factors in capturing user's attention, and thus, shares.
```{r, message=FALSE, warning=FALSE}
ggplot(newspopData, aes(x = num_keywords, y = shares/1000000)) +
geom_bar(aes(fill = day), stat="identity") +
scale_x_discrete(limits=c(2:10)) +
labs(subtitle = 'Shares based on number of keywords',
x = 'Number of Keywords',
y = 'Shares (in millions)',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
fill = 'Days') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
```
This graph represents the shares trend of any article based on the number of keywords. A general speculated trend would be: the higher the number of keywords, the higher it is descriptive about its content, and hence the number of shares would also be higher. However. this might not hold true for business based articles where short and crisp key words can also do the trick. Otherwise, the trend of shares should be positive with respect to number of keywords.
Now, the title sentiment polarity is one of the first things that users read while making a decision about reading an article let alone sharing it. Hence it is important to see how is the title sentiment polarity, is it negative or positive. Here, we are dividing the polarities into 4 sections; Highly positive (Sentiment Polarity between 0.5 and 1), positive (Sentiment Polarity between 0 and 0.5), negative ((Sentiment Polarity between -0.5 and 0)) and highly negative (Sentiment Polarity between -1 and -0.5).
Creating the new variable to classify the sentiments into above category:
```{r, message=FALSE, warning=FALSE}
pols = c("Extremely Positive", "Positive", "Negative", "Extremely Negative")
newspopData <- newspopData %>% mutate(title_Sentiment_Class =
if_else(title_sentiment_polarity > 0.5, pols[1],
if_else(title_sentiment_polarity > 0, pols[2],
if_else(title_sentiment_polarity > -0.5, pols[3], pols[4]))))
newspopData$title_Sentiment_Class <- as_factor(newspopData$title_Sentiment_Class)
```
Statistics about the articles that are in each of these categories:
```{r, message=FALSE, warning=FALSE}
knitr::kable(newspopData %>% group_by(title_Sentiment_Class) %>% summarize(Count_of_articles = n()))
```
```{r, message=FALSE, warning=FALSE}
knitr::kable(newspopData %>% group_by(title_Sentiment_Class, article_popularity_index) %>% summarize(Count_of_articles = n()))
```
The tables above represents the number of articles classified into each title sentiment category, and based on their popularity index.
```{r, message=FALSE, warning=FALSE}
knitr::kable(newspopData %>% group_by(title_Sentiment_Class, day) %>% summarize(Count_of_articles = n()))
```
This table further shows the shares of different title sentimental types of articles for each day of the week.
```{r, message=FALSE, warning=FALSE}
newspopData <- newspopData %>% mutate(content_Sentiment_Class =
if_else( global_sentiment_polarity > 0.5, pols[1],
if_else( global_sentiment_polarity > 0, pols[2],
if_else( global_sentiment_polarity > -0.5, pols[3], pols[4]))))
newspopData$content_Sentiment_Class <- as_factor(newspopData$content_Sentiment_Class)
# average article length
knitr::kable(newspopData %>%
group_by(content_Sentiment_Class, day)%>%
select(day, content_Sentiment_Class, n_tokens_content) %>%
summarise(Average_Content_Length = round(sum(n_tokens_content)/n())))
```
We further want to extend our analysis based on the average number of tokens on an article published during the weekdays and weekend.
```{r, warning=FALSE, message=FALSE}
ggplot(newspopData, aes(x=title_Sentiment_Class, y=shares/1000000)) +
geom_bar(aes(fill = day), stat="identity", position="dodge") +
labs(subtitle = 'Shares based on title sentiment and day of the week',
x = 'Title Sentiment Category',
y = 'Shares (in millions)',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset",
fill = 'Days') +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1)) +
geom_text(x = 3, y = 0.6, size = 3,
label = paste("Title Sentiment Score range", "Extremely Positive = 0.5 <--> 1",
'Positive = 0 <--> 0.5 ',
'Negative = -0.5 <--> 0',
'Extremely Negative = -1 <--> -0.5',
sep = '\n' ), color = 'black')
```
The above graph represents the shares of articles (in millions) based on title sentiment category for each day of the week. Some of the trends that can be thought of here is that entertainment and lifestyle articles can have a shares of both extremes ends of classifications, mostly in the starting of the week after any events that might have happened, or any other controversy that their target celebrity would be involved in.
For tech, a high number of positive ends shares can be expected, while social media, business, world can have a mix of all the types of sentimental categorical shares, depending on the latest news.
```{r, fig.width=20, fig.height=8}
ggplot(newspopData, aes(x= title_sentiment_polarity)) +
geom_density(aes(fill=title_Sentiment_Class), alpha = 0.7) +
facet_grid(. ~day)+
scale_fill_discrete(name = "Title Sentiment") +
labs(subtitle = 'Distribution of articles on title sentiment and day of the week',
x = 'Title Sentiment Polarity',
y = 'Density',
title = toupper(str_replace_all(params$data_channel, "_", " ")),
caption = "Source : News Popularity Dataset") +
theme(plot.caption = element_text(size=9, color="red", face="italic", hjust = 1))
```
The plot above shows the density of distribution of the shares of each type of title sentimental category for each day of the week, in other words, how much density is covered by each type of sentimental titled article for each day. Again, the trend would be the same for each type of the data, depending on various situations that drive these articles, and its publications.
# Modelling
Firstly, we are checking for the skewness in the data set for the response variable that is **"shares"**. For that we are plotting the distribution for the data of that variable below:
```{r, message=FALSE, warning=FALSE}
ggplot(newspopData) + geom_density(aes(x = shares))+
labs(title = "Shares Predictor Distribution in Dataset")
```
From the plot above showing distribution of shares, we can see that the distribution is left-skewed. To mitigate this, we are taking a log transform of that variable.
```{r, message=FALSE, warning=FALSE}
ggplot(newspopData) + geom_density(aes(x = log(shares)))+
labs(title = "Logarithmic transformation of shares distribution")
```
As seen from the plot above, after doing the log transformation, shares distribution can be estimated as a normal distribution, which will be better for the models that we would be fitting for this project. Hence, for all the models that are built in this project, the target response variable is the log transform **("logShares")** of the original shares variable.
```{r, message=FALSE, warning=FALSE}
set.seed(42)
# Removing mutated variables from the data set
lmData <- newspopData %>% select(-day, -title_Sentiment_Class, -content_Sentiment_Class, -article_popularity_index)
lmData$logShares <- log(lmData$shares)
# train test split
index <- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)
# Removing the shares variable because we are using logShares as response
trainDf <- lmData[index, ] %>% select(-shares)
testDf <- lmData[-index, ] %>% select(-shares)
```
## Variable Selection
For variable selection we used p-test values to consider significant coefficients from the linear model fit. For a confidence interval of 95%, the significant predictors are used to fit a Linear and a Lasso regression model.
For the significant parameter selection, our NULL and Alternate hypothesis is explained below:
H~0: All coefficients have zero slope
H~a: At-least one predictor variable has a non-zero slope
```{r}
modelLmVar <- lm(logShares~. ,data = trainDf)
# getting p-values
pVal <- data.frame(summary(modelLmVar)$coefficients[,4])
pVal$row_names <- row.names(pVal)
# checking if slope of intercept is significant (i.e. <<< 0.05)
pVal <- pVal %>% filter(pVal[,1] < 0.05)
cols = c()
# remove intercept term and store all significant intercepts in "cols" variable
if (pVal[,2][1] == "(Intercept)"){
cols = pVal[,2][-1]
}else{
cols = pVal[,2]
}
cols
```
## Multiple Linear Regression Model
Linear regression is a regression model that uses a straight line to describe a relationship between two or more predictor and a response variable. Out of the 54 predictor variables in the data set, the predictors that were deemed significant from the step above are used in this model, with a response variable of logShares.
```{r}
trControl <- trainControl(method = 'repeatedcv', number = 10)
# model training
modelLm <- train(logShares ~ .,
data = trainDf %>%
select(all_of(cols), logShares),
method = "lm",
preProcess = c('center', 'scale'),
trControl = trControl)
title = toupper(str_replace_all(params$data_channel, "_", " "))
# variable importance for linear model
plot(varImp(modelLm),
main = paste("Variable importance Plot for ", "\n", title))
lmPredict <- predict(modelLm, newdata = testDf %>% select(all_of(cols), logShares))
lmRes <- postResample(lmPredict, obs = testDf$logShares)
# Adjusted R-square
n <- dim(testDf)[1]
p <- dim(testDf %>% select(all_of(cols)))[2]
adjR2Linear <- 1-(((1-lmRes[2]^2)*(n-1))/(n-p-1))
lmRes
```
## Lasso Regression
LASSO or the Least Absolute Shrinkage and Selection Operator is a special type of linear model that uses penalty based shrinkage. Shrinkage here means converging data values to a central point, or where the absolute magnitude of the coefficients are shrunk to a 0 based on L1 regularization penalty method which adds a penalty factor(lambda) for every misclassified or wrongly predicted values.
LASSO generally works well with models having less variable parameters. From the above process, we have narrowed down a few most influencing and statistically significant variables which we will be using for this model as well.
The caret package is used to train the model here using the **lasso** method. The prediction metrics are then derived using the postResample function. The same predictor and response variable from the model above are used here as well.
```{r, message=FALSE, warning=FALSE}
set.seed(43)
# train
fitControl <- trainControl(method = "repeatedcv",number = 5)
fitLASSO <- train(logShares ~ . ,
data=trainDf,
method = "lasso",
trControl = fitControl,
preProcess = c("center", "scale")
)
predLASSO <- predict(fitLASSO, newdata = testDf)
coeff <- predict(fitLASSO$finalModel, type = "coef", mode = "fraction",
s = fitLASSO$bestTune$fraction, testDf)
lassoRes <- postResample(predLASSO, obs = testDf$logShares)
# Adjusted R-square
n <- dim(testDf)[1]
p <- dim(testDf)[2]
adjR2Lasso <- 1-(((1-lassoRes[2]^2)*(n-1))/(n-p-1))
lassoRes
```
## Random Forest
Before building the random forest model, seeing at the data, the data set is high dimensional, hence, it would be easier to select the "best" subsets of variables for random forest ensemble tree learning method to make the process more efficient. More details about Random Forest is given before the model is formulated.
## Best Variable Selection: Principal Component Analysis
Since the data set includes a large number of variables, it becomes important to select the best subsets of the data that defines the data set and variations in the data set for number of shares. To get these variables, here we are conducting Principal Component Analysis (PCA). PCA standardizes the variables, and represent new variables/PCs that are a function of the original variables from the data set.
Again, to keep the models separate, we are again splitting the data into train and test data set, and removing the categorical variables that were added previously in the data. After that the PCs are formulated.
```{r, message=FALSE, warning=FALSE}
set.seed(200)
newspopData$logShares <- log(newspopData$shares)
trainIndex <- createDataPartition(newspopData$shares, p = 0.7, list = FALSE)
trainData <- newspopData[trainIndex, ]
testData <- newspopData[-trainIndex, ]
```
Creating new PCs, and plotting the cumulative variance graph:
```{r, message=FALSE, warning=FALSE}
# Removing shares, logShares and mutated variables to form PCs
PCs <- prcomp(trainData %>%
select(-shares, -logShares, -day,
-title_Sentiment_Class,
-content_Sentiment_Class,
-article_popularity_index),
center=TRUE, scale=TRUE)
# cumulative variance
var = cumsum(PCs$sdev**2/sum(PCs$sdev**2))
show(plot(var, xlab = "Principal Component",
ylab = "Cum. Prop of Variance Explained", ylim = c(0, 1), type = 'b'))
pt <- min(which(var > 0.80))
points(pt, var[pt], col="red", cex=2, pch=20)
text(8, 0.8, col="red", paste0(round(100*var[pt],2), "% variance \nis explained by ", pt, " \nnumber of PC(s)"))
```
From the above graph, we can see that at least 80% variance is explained by `r pt` PCs, and hence for the subsequent steps, `r pt` PCs will be used.
Now, we are creating the training data for Random Forest Ensemble model.
```{r, message=FALSE, warning=FALSE}
# Creating training data PCs without the logShares variables
treeTrainData <- predict(PCs, newdata = trainData %>% select(-shares, -logShares)) %>%
as_tibble() %>%
select(PC1:paste0("PC", pt))
treeTrainData$logShares <- trainData$logShares
# This the final training data that will be used to train random forest model
treeTrainData
```
### Random Forest Model:
Random Forest model is a tree based ensemble learning method where the training data set is sampled into smaller data sets, and some variables from the data set are included in the smaller sampled data. A tree is fitted to that data. This method is repeated several times, and finally a model is finalized that will be an average of all these sampled tree models. Since Random Forest takes only a subset of variables in any given sampling, it creates really strong model to predict on with unknown variables since the sampled data is able to explain variance much better with fewer variables.
We are using train() function from caret package to train the Random Forest model using the **rf** method. Pre-processing is done by centering and scaling the data, and a 3 fold cross validation is also used. (The value 3 is selected to have some computation ease). Lastly, the model is tried for a number of values for the tuning parameter "mtry", out of which the best mtry value is selected.
Training the model:
```{r, message=FALSE, warning=FALSE}
set.seed(400)
trctrl <- trainControl(method = "repeatedcv", number = 3)
randomForest <- train(logShares ~., data = treeTrainData,
method = "rf",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = expand.grid(mtry = 1:as.integer(pt/3)))
randomForest
```
Testing the model on the test data set to see the metrics of evaluation. The predict() function is used to create both PC based testing data, and to predict the logShares using the Random Forest model. Further, the postResample() function is used to extract metrics such as RMSE, R-Squared, and MAE, represented below.
```{r, message=FALSE, warning=FALSE}
# Creating test data PCs
treeTestData <- predict(PCs, newdata = testData %>% select(-shares, -logShares)) %>%
as_tibble() %>%
select(PC1:paste0("PC", pt))
# Completing the test data
treeTestData$logShares <- testData$logShares
rfPredict <- predict(randomForest, newdata = treeTestData)
randomFRes <- postResample(rfPredict, obs = treeTestData$logShares)
# Adjusted R-square
n <- dim(treeTestData)[1]
p <- dim(treeTestData)[2] - 1
adjR2Rf <- 1-(((1-randomFRes[2]^2)*(n-1))/(n-p-1))
randomFRes
```
## Boosted Tree Model
Boosting is another tree based approach for improving the predictions resulting from a decision tree. Like other tree models, boosting can be applied for classification and regression. Boosting works in **sequentially**: each tree is grown using information from previously grown trees. In this project, we are using **gbm** method within caret train function.
An additional input tunegrid also provided to train function, which will find best hyper parameter combinations of number of trees, and depth.
Here, all the available parameters are used as predictor variables, and logShares is the target response variable.
```{r, message=FALSE, warning=FALSE}
# Training the model
boostedTree <- train(logShares ~., data = trainData %>% select(-shares), method = "gbm",
trControl = trainControl(method = "repeatedcv", number = 3),
preProcess = c("center", "scale"),
tuneGrid = expand.grid(n.trees = c(25, 50, 100, 150, 200),
interaction.depth = 1:4,
shrinkage = 0.1,
n.minobsinnode = 10),
verbose = FALSE)
# Best combination of tuning hyperparameters
print(boostedTree$bestTune)
boostedTPredict <- predict(boostedTree, newdata = testData %>% select(-shares))
boostedTRes <- postResample(boostedTPredict, obs = (testData %>% select(-shares))$logShares)
# plot
plot(boostedTree, xlab = "Max Tree Depth ",
ylab = "Root Mean Square Error (Found through CV)",
main = "All Predictors - Minimizing RMSE")
# Adjusted R-square
n <- dim(testData)[1]
p <- dim(testData)[2] - 1
adjR2Boosted <- 1-(((1-boostedTRes[2]^2)*(n-1))/(n-p-1))
boostedTRes
```
From the plot we can visualize the best combinations of parameters that were used to find minimum RMSE value using Boosted tree algorithm, which can also be seen by using **(boostedTree$bestTune)** function.
## Model Comparision
```{r, message=FALSE, warning=FALSE}
model_types = c("Linear", "Ensemble")
modelComparisonDf = tibble(model = c("Linear Model"), model_type = c(model_types[1]), RMSE = c(round(lmRes[[1]],3)), R_Squared = c(round(lmRes[[2]], 4)), Adj_R2 = c(round(adjR2Linear,4)) )
modelComparisonDf <- rbind(modelComparisonDf, c("LASSO", model_types[1], round(lassoRes[[1]],3), round(lassoRes[[2]],4), round(adjR2Lasso,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Random Forest", model_types[2], round(randomFRes[[1]],3), round(randomFRes[[2]],4), round(adjR2Rf,4)))
modelComparisonDf <- rbind(modelComparisonDf, c("Boosted Trees", model_types[2], round(boostedTRes[[1]],3), round(boostedTRes[[2]],4), round(adjR2Boosted,4)))
knitr::kable(modelComparisonDf)
```
```{r, message=FALSE, warning=FALSE}
bestLinearModel = ""
bestLinearRMSE = ""
bestLinearR2 = ""
bestEnsembleModel = ""
bestEnsembleRMSE = ""
bestEnsembleR2 = ""
bestOverallModel = ""
bestOverallRMSE = ""
bestOverallR2 = ""
lineardf <- modelComparisonDf %>% filter(model_type == "Linear")
ensembleDf <- modelComparisonDf %>% filter(model_type == "Ensemble")
if (length(unique(lineardf$R_Squared)) > 1){
bestLinearModel <- lineardf[which.min(lineardf$RMSE),][[1]]
bestLinearRMSE <- lineardf[which.min(lineardf$RMSE),][[3]]
bestLinearR2 <- lineardf[which.min(lineardf$RMSE),][[4]]
} else{
bestLinearModel <- lineardf[which.max(lineardf$Adj_R2),][[1]]
bestLinearRMSE <- lineardf[which.max(lineardf$Adj_R2),][[3]]
bestLinearR2 <- lineardf[which.max(lineardf$Adj_R2),][[4]]
}
if (length(unique(ensembleDf$R_Squared)) > 1){
bestEnsembleModel <- ensembleDf[which.min(ensembleDf$RMSE),][[1]]
bestEnsembleRMSE <- ensembleDf[which.min(ensembleDf$RMSE),][[3]]
bestEnsembleR2 <- ensembleDf[which.min(ensembleDf$RMSE),][[4]]
} else{
bestEnsembleModel <- ensembleDf[which.max(ensembleDf$Adj_R2),][[1]]
bestEnsembleRMSE <- ensembleDf[which.max(ensembleDf$Adj_R2),][[3]]
bestEnsembleR2 <- ensembleDf[which.max(ensembleDf$Adj_R2),][[4]]
}
if (length(unique(modelComparisonDf$R_Squared)) > 1){
bestOverallModel <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[1]]
bestOverallRMSE <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[3]]
bestOverallR2 <- modelComparisonDf[which.min(modelComparisonDf$RMSE),][[4]]
} else{
bestOverallModel <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[1]]
bestOverallRMSE <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[3]]
bestOverallR2 <- modelComparisonDf[which.max(modelComparisonDf$Adj_R2),][[4]]
}
```
## Conclusion
Selection of best model is based on comparing the lowest prediction RMSE value across all models.
For the data set for **`r toupper(str_split(params$data_channel, "_")[[1]][4])`**:
The best linear model is `r bestLinearModel` with RMSE value of `r bestLinearRMSE`, and R Squared value of `r bestLinearR2`.
The best ensemble method is `r bestEnsembleModel` with RMSE value of `r bestEnsembleRMSE`, and R Squared value of `r bestEnsembleR2`.
The best overall model is `r bestOverallModel` with RMSE value of `r bestOverallRMSE`, and R Squared value of `r bestOverallR2`.