-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathExplore and Sumarize Data.Rmd
687 lines (508 loc) · 23.3 KB
/
Explore and Sumarize Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
---
title: 'P4: Explore and Sumarize Data'
author: "David Gonzalez Cagigas"
date: "7 Februar 2017"
output: html_document
---
========================================================
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
# Load all of the packages that you end up using
# in your analysis in this code chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk.
# This prevents the code from displaying in the knitted HTML output.
# You should set echo=FALSE for all code chunks in your file.
#
#install.packages("ggplot2", dependencies = T)
#install.packages("knitr", dependencies = T)
#install.packages("dplyr", dependencies = T)
#install.packages('gridExtra')
#install.packages("GGally")
source("https://raw.githubusercontent.com/briatte/ggcorr/master/ggcorr.R")
library(gridExtra)
library(ggplot2)
library(knitr)
library(dplyr)
```
First of all, we will load the data set Red Wine into pf, and see some basic
information.
```{r, Load_the_Data}
pf <- read.csv('wineQualityReds.csv')# Load the Data
pf$X <- NULL
names(pf)
str(pf)
summary(pf)
```
We can appreciate the variable names in the dataset, number of obervations,
number and type of variables and other basic statistic information about the
dataset. I have omitted the first variable as it is just the index.
If we dive deeper, we see that there are 6 qualities (from 3 to 8), the alcohol
degrees is between 8.40 and 14.90 degrees. Density vary from 0.9901 to 1.0037.
And 10 more variables. It will be intersting to see the pH concentration,
alcohol or density depending on the quelity level.
# Univariate Plots Section
> In this section, you should perform some preliminary exploration of
your dataset. Run some summaries of the data and create univariate plots to
understand the structure of the individual variables in your dataset. Don't
forget to add a comment after each plot or closely-related group of plots!
There should be multiple code chunks and text sections; the first one below is
just to help you get started.
We will start showing one of the most important features, which is the quality
of the wine. With this plot we weill have an idea about what is the most common
quality in wines.
```{r}
by(pf$density, pf$quality, median)
by(pf$alcohol, pf$quality, median)
by(pf$pH, pf$quality, median)
qplot(factor(quality), data=pf, geom="bar")
```
As you can see, density and alcohol do not have a dependency on que quality of
the wines, however quality seems to improve when Ph mean decrease. It is useful
to know the number of observation for each quality.
Histograms
```{r}
grid.arrange( ggplot(aes(x=alcohol),data=pf) +
geom_histogram( binwidth=0.5) +
ggtitle("Histogram of Alcohol (%)"), ggplot(aes(x=1, y=alcohol),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$alcohol)
qplot(x = alcohol, data = pf, binwidth = 0.5, geom = "freqpoly")
```
Alcohol distribution is a right skewed, concentrated in 7.5. We can see a few
outliers. It seems to be more common the low alcohol wines than the high alcohol
wine. As you can see the most common wines have unde 10 grades.
```{r}
grid.arrange( ggplot(aes(x=total.sulfur.dioxide),data=pf) +
geom_histogram( binwidth=10) +
ggtitle("Histogram of total sulfur dioxide (mg/dm^3)"),
ggplot(aes(x=1, y=total.sulfur.dioxide),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$total.sulfur.dioxide)
qplot(x = total.sulfur.dioxide, data = pf, binwidth = 10, geom = "freqpoly")
```
Sulfur Dioxide distribution is a right skewed, concentrated in 20. We can see
multiples outliers, specially 2. With the Sulfur Dioxide happens the same as
with alcohol. Most of wines have a low quantity of sulfur dioxide, the peak is
at 40g/dm3.
```{r}
grid.arrange( ggplot(aes(x=fixed.acidity),data=pf) +
geom_histogram( binwidth=1) +
ggtitle("Histogram of fixed acidity (g/dm^3)"),
ggplot(aes(x=1, y=fixed.acidity),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$fixed.acidity)
qplot(x = fixed.acidity, data = pf, binwidth = 1, geom = "freqpoly")
```
We can see the fixed acidity is a right skewed graphic concentred in 7. We can
see multiple outliers. Fixed acidity has a similar distribution as the other
features with its peak at 7g/dm3.
```{r}
grid.arrange( ggplot(aes(x=volatile.acidity),data=pf) +
geom_histogram( binwidth=0.05) +
ggtitle("Histogram of volatile acidity (g/dm^3)"),
ggplot(aes(x=1, y=volatile.acidity),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$volatile.acidity)
qplot(x = volatile.acidity, data = pf, binwidth = 0.05, geom = "freqpoly")
```
The distribution of volatile acidity is a bimodal right skewed, concentred in
0.35 and 0.6. There are a few outliers. Volatile acidity has a similar
distribution as the other features, however volatile acidity has a broad peak
between 0.4 and 0.7 g/dm3.
```{r}
grid.arrange( ggplot(aes(x=citric.acid),data=pf) +
geom_histogram( binwidth=0.05) +
ggtitle("Histogram of citric acid (g/dm^3)"),
ggplot(aes(x=1, y=citric.acid),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$citric.acid)
qplot(x = citric.acid, data = pf, binwidth = 0.05, geom = "freqpoly")
```
Citrid Acid is a not normal distribution. Only an outlier. It doesn't seem to be
a common citrid acid for the wines in the data set.
```{r}
grid.arrange( ggplot(aes(x=residual.sugar),data=pf) +
coord_cartesian(xlim = c(1,5)) +
geom_histogram( binwidth=0.5) +
ggtitle("Histogram of residual sugar (g/dm^3)"),
ggplot(aes(x=1, y=residual.sugar),data=pf) +
geom_boxplot( ) , nrow =1)
summary(pf$residual.sugar)
qplot(x = residual.sugar, data = pf, binwidth = 0.5, geom = "freqpoly") +
coord_cartesian(xlim = c(1,5))
```
Several outliers. As most of the features, it is a right skewed distribution
concentrated in 2 g/dm3.
```{r}
grid.arrange( ggplot(aes(x=chlorides),data=pf) +
coord_cartesian(xlim = c(0,0.2)) +
geom_histogram( binwidth=0.01) +
ggtitle("Histogram of chlorides (g/dm^3)"),
ggplot(aes(x=1, y=chlorides),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$chlorides)
qplot(x = chlorides, data = pf, binwidth = 0.01, geom = "freqpoly") +
coord_cartesian(xlim = c(0,0.2))
```
Chloricles is right skewed distribution as well, concentrated in 0.08. Several
outliers. Chloricles has a similar distribution as the other
features with its peak at 0.08 g/dm3.
```{r}
grid.arrange( ggplot(aes(x=free.sulfur.dioxide),data=pf) +
geom_histogram( binwidth=5) +
ggtitle("Histogram of free sulfur dioxide (mg/dm^3)"),
ggplot(aes(x=1, y=free.sulfur.dioxide),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$free.sulfur.dioxide)
qplot(x = free.sulfur.dioxide, data = pf, binwidth = 5, geom = "freqpoly")
```
Free Sulfur Dioxide is a right skewed distribution concentrated in 6. We can see
a few outliers. Free Sulfur Dioxide has a similar distribution as the other
features with its peak at 10 mg/dm3.
```{r}
grid.arrange( ggplot(aes(x=density),data=pf) +
geom_histogram( binwidth=0.001) +
ggtitle("Histogram of density (g/cm^3)"),
ggplot(aes(x=1, y=density),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$density)
qplot(x = density, data = pf, binwidth = 0.001, geom = "freqpoly")
```
We can see the density distribution is normal and concentrated in 0.997. We can
see a few outliers.
```{r}
grid.arrange( ggplot(aes(x=sulphates),data=pf) +
coord_cartesian(xlim = c(0,1.5)) +
geom_histogram( binwidth=0.1) +
ggtitle("Histogram of sulphates (g/dm^3)"),
ggplot(aes(x=1, y=sulphates),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$sulphates)
qplot(x = sulphates, data = pf, binwidth = 0.1, geom = "freqpoly") +
coord_cartesian(xlim = c(0,1.5))
```
Sulphates distribution is right skewed, concentrated in 0.6. Several outliers.
Free Sulfur Dioxide has a similar distribution as the other features with its
peak at 0.6 g/dm3.
```{r}
grid.arrange( ggplot(aes(x=pH),data=pf) +
geom_histogram( binwidth=0.1) +
ggtitle("Histogram of pH"),
ggplot(aes(x=1, y=pH),data=pf) +
geom_boxplot( ), nrow =1)
summary(pf$pH)
qplot(x = pH, data = pf, binwidth = 0.1, geom = "freqpoly")
```
pH distribution is normal concentrated in 3.3. We can appreciate a few outliers.
We can see above a simple histogram for each feature included in the analysis.
Theses histograms help you to have an idea about the dataset.
```{r}
qplot(x = density, data = pf, binwidth = 0.0001)+ ##A line geometry
facet_wrap(~quality)+
geom_histogram()
qplot(x = alcohol, data = pf, binwidth = 0.01)+ ##A line geometry
facet_wrap(~quality)+
geom_histogram()
qplot(x = pH, data = pf, binwidth = 0.01)+ ##A line geometry
facet_wrap(~quality)+
geom_histogram()
```
We can appreciate that pH tend to have similar values between 3 and 3.5,
specially in quality 5, 6 and 7 the other qualities can not be appriciated due
to lack of data. Density seems to have the same tendency, however alcohol
have a flater graphs, where you can see a tendency in quality 5 (about 11
degrees), meanwhile quality 7 is between 9 and 13 or quality 6 is between
8 and 13.
# Univariate Analysis
### What is the structure of your dataset?
There are 1599 Red Wines analysed in 12 different characteristics
(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides,
free.sulfur.dioxide, total.sulfur.dioxidedensity, pH, sulphates, alcohol and
quality)
### What is/are the main feature(s) of interest in your dataset?
The main feature is the quality. It is interesting to caompare any
characteristic of the wine depending on the final quality.
### What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?
I use all the features of the wine to compare theis variation of them with the
quality or bewteen them.
### Did you create any new variables from existing variables in the dataset?
### Of the features you investigated, were there any unusual distributions? Did
you perform any operations on the data to tidy, adjust, or change the form of
the data? If so, why did you do this?
Since the data is cleaned, I didn't perfomr any cleaning process. I realized
most of graphs are right skewed. And graphs with vlaues of the highest quality
do not have a defined shape. It might be because of the lack of samples.
# Bivariate Plots Section
> It might be interesting a lot of correlations. As we have 12
variables we will try with several combinations.
We will start the bivariate section with a correlation analysis that can help
us to decide which features are more likely to have a relation.
```{r}
ggcorr(pf, nbreaks = 4, palette = "RdGy",
label = TRUE, label_size = 3, label_color = "white")
```
As we can see on the above plot, the feature with most correlation are free.
sulfur.dioxide-total.sulfur.dioxide, fixed.acidity-citrid.acid,
fixed.acidity-pH and density-dixed.acidity.
Along this analysis we will try to focus on the features, but we will study
other features as well. We weill start with a scatter plot comparing the
features with the highest grade of correlation.
```{r, Bivariate_Plots}
ggplot(aes(x= total.sulfur.dioxide, y = free.sulfur.dioxide) , data = pf) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality'))
ggplot(aes(x= fixed.acidity, y = density) , data = pf) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality'))
ggplot(aes(x= fixed.acidity, y = citric.acid) , data = pf) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality'))
ggplot(aes(x= pH, y = fixed.acidity) , data = pf) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality'))
```
I displayed four graphs showing the relation between 2 variables. We can see
easily a correlation between all the features compared.
Total Sulfur Dioxide - Free Sulfur Dioxide
The bigger Total Sulfur Dioxide is, the bigger Free Sulfur Dioxide is. It seems
to be a proportional relation between these features.
Density - Fixed Acidity
The bigger Density is, the bigger Fixed Acidity is. It seems to be a
proportional relation between these features.
Citrid Acid - Fixed Acidity
The bigger Density is, the bigger Fixed Acidity is. It seems to be a
proportional relation between these features.
pH - Fixed Acidity
The bigger Density is, the smaller Fixed Acidity is. It seems to be a
proportional inverse relation between these features.
The result show what we have see with the correlation analysis, Fixed Acidity
and pH has a inverse proportional correlation meanwhile the other analysis has
a poportional correlation.
```{r}
quality_groups <- group_by(pf, quality) # first groups data by age
pf.by_quality <- summarise(quality_groups, # then summarizes
volatile.acidity_mean = mean(volatile.acidity),
fixed.acidity_mean = mean(fixed.acidity),
residual.sugar_mean = mean(residual.sugar),
chlorides_mean = mean(chlorides),
free.sulfur.dioxide_mean = mean(free.sulfur.dioxide),
total.sulfur.dioxide_mean = mean(total.sulfur.dioxide),
sulphates_mean = mean(sulphates),
density_mean = mean(density),
alcohol_mean = mean(alcohol),
pH_mean = mean(pH),
n = n() # number of items in each group
)
pf.by_quality <- arrange(pf.by_quality, quality) # and sort by age
pf.by_quality
```
We can see a mean table of each variable separated by quality level. We will see
easily correlation in some of them like chlorides_mean which decrease when the
quality decrease, or pH with a similar behaviour. We will display this graphs
below.
```{r}
ggplot(aes(factor(quality), alcohol), data = pf) +
geom_jitter( alpha = .3) +
geom_boxplot( alpha = .5,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point",
color = "red", shape = 8, size = 4)
ggplot(aes(factor(quality), volatile.acidity), data = pf) +
geom_jitter( alpha = .3) +
geom_boxplot( alpha = .5,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point",
color = "red", shape = 8, size = 4)
ggplot(aes(factor(quality), free.sulfur.dioxide), data = pf) +
geom_jitter( alpha = .3) +
geom_boxplot( alpha = .5,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point",
color = "red", shape = 8, size = 4)
ggplot(aes(factor(quality), sulphates), data = pf) +
geom_jitter( alpha = .3) +
geom_boxplot( alpha = .5,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point",
color = "red", shape = 8, size = 4)
ggplot(aes(factor(quality), pH), data = pf) +
geom_jitter( alpha = .3) +
geom_boxplot( alpha = .5,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point",
color = "red", shape = 8, size = 4)
ggplot(aes(factor(quality), density), data = pf) +
geom_jitter( alpha = .3) +
geom_boxplot( alpha = .5,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point",
color = "red", shape = 8, size = 4)
```
One of the most interesting variables is the quality. In the plot above, I
compare this features with the other. We can see thanks to the box plot the
distribution of every feature divided by quality.
```{r}
cor(pf$alcohol, pf$density)
cor(pf$residual.sugar, pf$density)
cor(pf$fixed.acidity, pf$density)
cor(pf$pH, pf$density)
cor(pf$alcohol, pf$volatile.acidity)
cor(pf$residual.sugar, pf$volatile.acidity)
```
There is barely a correlation between residual sugar and volatile acidity, as we
expected. However we can find a correlation of 0.668 between density and acidity
which means that the bigger density the wine will be more acid. Alcohol and
density has a inverse correlation, it means the more alcohol the less dense the
wine is.
It happen the same with alcohol and acidity to lesser extent.
```{r}
q1 <- ggplot(aes(x = volatile.acidity_mean, y = pH_mean), data = pf.by_quality) +
geom_line() +
geom_smooth()
q2 <- ggplot(aes(x = volatile.acidity_mean, y = free.sulfur.dioxide_mean),
data = pf.by_quality) + geom_line() + geom_smooth()
q3 <- ggplot(aes(x = density_mean, y = pH_mean), data = pf.by_quality) +
geom_line() + geom_smooth()
q4 <- ggplot(aes(x = density_mean, y = fixed.acidity_mean),
data = pf.by_quality) + geom_line() + geom_smooth()
library(gridExtra)
grid.arrange(q1,q2,q3,q4,ncol = 1)
```
This three last graphs show the relation between pHacidity,sulfur
dioxide-acidity and pH-density. We see relation betwen pH and acidity, it is
a left skewed graph while the second sample is a right skewed graph.
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?
As we have said before, there are a lot of correlations between variables. One
example is acidity and density. We got a correlation of 0.67. It seems that the
more dense is the wine the taste get more acid.
### Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?
We can see many correlations between most of variables. For example residual
sugar and density, which seems logical as the more sugar more dense. Or pH and
density.
### What was the strongest relationship you found?
Density and acididity was the strongest one.
# Multivariate Plots Section
```{r, Multivariate_Plots}
ggplot(aes(x = alcohol, y = citric.acid, color = factor(quality)), data = pf) +
geom_point(alpha = 0.8, size = 2) +
geom_smooth(method = "lm", se = FALSE,size=1) +
theme_dark() +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality'))
ggplot(pf, aes(x = sulphates, y = alcohol, colour=factor(quality))) +
geom_density2d(bins=2) +
scale_color_brewer() +
theme_dark() +
geom_point(alpha=0.1)
ggplot(aes(x = volatile.acidity, y = residual.sugar, color = factor(quality)),
data = pf) + geom_point(alpha = 0.8, size = 2) +
geom_smooth(method = "lm", se = FALSE,size=1) +
theme_dark() +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality'))
```
First graph we can appreciate a relation between quality and alcohol, as most of
white points are wines with low alcohol. However it is not easy to see a
relation between density and pH. Density distribution does not depends on pH.
In the second graph seems clear, good quality winehas the biggest concentration
of alcohol and sulphate.
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
As we can see in last three graphs, alcohol, pH, sulphates and residual.sugar
are important for wine quality however ther is not a perfect value for each
feature. The quality level depends on the sum up of every feature.
We can see high contained in alcohol and low residual.sugar with high quality
and oposite.
### Were there any interesting or surprising interactions between features?
Personally, I expected that all features tend to a perfect value. But as I said
in last question, wine quelity is a summ of features.
------
# Final Plots and Summary
You've done a lot of exploration and have built up an understanding
of the structure of and relationships between the variables in your dataset.
Here, you will select three plots from all of your previous exploration to
present here as a summary of some of your most interesting findings. Make sure
that you have refined your selected plots for good titling, axis labels (with
units), and good aesthetic choices (e.g. color, transparency). After each plot,
make sure you justify why you chose each plot by describing what it shows.
### Plot One
```{r, Plot_One}
qplot(factor(quality), data=pf, geom="bar", main = "Wines distribution")+
xlab("Quality") +
ylab("Number of wines")
```
### Description One
This first graph represent the number of wines analyzed in each quality. We can
see most of wines have a quality of 5 or 6, meanwhile the qualities 3, 4 and 8
have barely wines.
### Plot Two
```{r, Plot_Two}
q1 <- ggplot(aes(x = quality, y = pH), data = pf) +
geom_jitter( alpha = .1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality')) +
xlab("Quality") +
ylab("pH") +
ggtitle("Quality / pH")
q2 <- ggplot(aes(x = quality, y = free.sulfur.dioxide), data = pf) +
geom_jitter( alpha = .1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality')) +
xlab("Quality") +
ylab("Free Sulfur Dioxide (eq/L)") +
ggtitle("Quality / SO2 Mean")
q3 <- ggplot(aes(x = quality, y = alcohol), data = pf) +
geom_jitter( alpha = .1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality')) +
xlab("Quality") +
ylab("Alcohol (%)") +
ggtitle("Quality / Alcohol (%)")
q4 <- ggplot(aes(x = quality, y = fixed.acidity), data = pf) +
geom_jitter( alpha = .1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality')) +
xlab("Quality") +
ylab("Fixed Acidity (g/L)") +
ggtitle("Quality / Fixed Acidity")
library(gridExtra)
grid.arrange(q1,q2,q3,q4,ncol = 2)
```
### Description Two
We can see 4 scatter plot comparing the quality with 4 different features,
pH, Free Sulfur Dioxide, Alcohol and Fixed Acidity. Here we can see the most
likely value for each level.
### Plot Three
```{r, Plot_Three}
ggplot(aes(x=pH, y=alcohol, color=factor(quality)), data = pf) +
geom_point(alpha = 0.8, size = 2) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq', guide=guide_legend(title='Quality')) +
xlab("pH") +
theme_dark() +
ylab("Alcohol Concentration (Degree)") +
labs(color = "Quality") +
ggtitle("pH, Alcohol, Quality and Density Graph Relation")
```
### Description Three
Here we can appreciate many relations. We can clearly see whiter colors are
below while darker colors are above, it means quality and alcohol has a
proportional relation. Ph does not seem to be related to any other parameter and
density seems to be lower with higher alcohol concentration.
------
# Reflection
The wines data set contains information on 1599 wines across twelve variables
from around 2009. Being an example dataset I expected clear relations between
its features, however barely you can find some relations. First thing surprised
me was that most of wines measured had a quality of 5 or 6 while qualities as 3,
4 or 8 have very few samples. I guess it means that most of wines has a normal
quality and just a few have a high or low quality.
The most cleare relations I saw was the alcohol concentration, it is related
with quality and with density.
Another thing is every feature has a perfect value indicidually, as you can see
in the firsts graphs where I compare quality with every feature. However, when
you have a best quality wine, these values change. It is like a lower value in
alcohol can compensate a higher value in sugar.