-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathqPCR.Rmd
884 lines (679 loc) · 31.7 KB
/
qPCR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
---
title: "E03 - qPCR (Analysis of Gene Expression @ UCT Prague)"
author:
- Jiri Novotny jiri.novotny@img.cas.cz
- Studuj bioinformatiku! http://studuj.bioinformatiku.cz
institute: "Laboratory of Genomics and Bioinformatics @ Institute of Molecular Genetics of the ASCR"
output:
rmdformats::readthedown:
highlight: "kate"
lightbox: true
thumbnails: true
gallery: true
toc_depth: 4
self_contained: true
number_sections: false
toc_collapsed: false
df_print: "paged"
date: "`r Sys.Date()`"
---
```{r, child = here::here("_assets/custom.Rmd"), eval = TRUE}
```
```{js, echo = FALSE}
$(document).ready(function() {
$("#div_lfc_table table").addClass("table table-condensed");
});
```
***
Copy, please, these files and directories to your personal directory:
```{bash, eval = FALSE}
cp -r ~/shared/AGE_current/Exercises/{E03-qPCR,age_library.R} ~/AGE/Exercises
```
**Assignment**
The assignment for this exercise is to implement (mainly visualization) functions, which you will use later on different types of gene expression data.
You will implement these functions in `age_library.R` where are empty function definitions.
You don't have to stick to all predefined parameters and returned variables, just keep in mind that your implementations should make outputs similar to outputs in this document.
Also, in several cases there are alternative functions (such as `plot_pca()` and `plot_pca_ggpairs()`), but you don't have to implement them all, if not said.
Sections containing functions to be implemented are marked with `r emo::ji("hammer_and_wrench")`
Anyway, here is the minimum set of functions you need to implement (arguments are up to you, it's about the output):
- `plot_hc(m, color_by)`:
Calculates hiearchical clustering of matrix `m` columns and plot a dendrogram, whose labels are colored according to `color_by` vector.
- `plot_pca(m, sample_data, color_by)`:
Calculates PCA of matrix `m` and makes a scatterplot of first two principal compoments with points colored by `color_by` column in `sample_data`.
- `plot_heatmap(m, column_annotation, z_score = FALSE)`:
Makes a heatmap of matrix `m` with column annotation specified in `column_annotation` dataframe.
Optionally, converts values in `m` to z-score (row-wise).
- `plot_boxplots(plot_data, x, y)`:
Using long `plot_data`, plots boxplots of values in `y` column, splitted to groups defined in `x` column.
- `compute_m(gene, cp)`:
Computes the M value of $C_P$ values (geNorm algorithm).
Pseudocode can be found in documentation for this function in `age_library.R`.
- `test_gene(gene, gene_data, gene_col, value_col, group_col)`:
Using long `gene_data`, calculates between group statistical test (e.g. t-test) of `gene`
defined in `gene_col` column with values in `value_col` column, and groups in `group_col` column.
- `test_gene_table(gene_data, gene_col, value_col, group_col)`:
The same as `test_gene()`, but calculates tests for all genes in `gene_col` column and returns a dataframe of results.
**After you implement these functions, modify their calling in this document and render it again.**
Obviously, you need to remove sections with functions you haven't implemented.
Please, return the assignment in this form:
- In the header, put your name and e-mail to `author` parameter (under `params`).
- Copy all finished files to directory named `E03-qPCR_assignment_<your_name>`: modified Rmd, rendered HTML, `age_library.R` with your implementations.
Put this directory to ZIP archive named `E03-qPCR_assignment_<your_name>.zip`.
- In rendered HTML there should be included all code chunks with full output (warnings, messages, etc.).
- Upload the ZIP file to the appropriate assignment in MS Teams.
> Note that your functions will be universal, serving not only for qPCR data.
This is mainly true for exploratory analysis functions (hiearchical clustering, PCA, etc.),
where input (expression matrix) is the same also for microarray and RNA-seq data.
***
# Introduction
[Quantitative real-time PCR](https://en.wikipedia.org/wiki/Real-time_polymerase_chain_reaction) monitors the amplification
of a targeted DNA molecule during the PCR (i.e., in real time), not at its end, as in conventional PCR.
`)
In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules.
This amplification depends on initial RNA concentration and approximates the gene expression.
We select a fluorescence signal threshold $S_T$ and measure crossing points $C_P$ (also called threshold cycle $C_T$) for samples.
To correct for initial RNA concentration and quality, reference genes are used (dashed curves).
These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline.
First, we normalize target genes in each sample to reference gene (or they geometric mean) by calculating $\Delta C_P = C_P^t - C_P^r$.
Obtained $\Delta C_P$ values are then used to calculate the <ins>relative change</ins> of gene expression between samples.
In this case, we are interested how much more or less is gene expressed in $\text{treatment}$ sample ($\text{trt}$), relative to $\text{control}$ ($\text{cnt}$) sample.
This relative change is called $\text{fold-change}$: $FC = 2^{-\Delta\Delta C_P} = 2^{-(\Delta C_{P, trt} - \Delta C_{P, cnt})}$.
We usually report $FC$ in $log_2$ scale: $LFC = \Delta C_{P, cnt} - \Delta C_{P, trt} = log_2(FC)$
Important is to realize that the higher $C_P$ is, the **lower** initial mRNA concentration was.
This is why we put minus /$-$/ in front of $\Delta\Delta C_P$.
We usually don't consider the efficiency of PCR and expect the mRNA concentration to double in each cycle.
Otherwise $FC = (1+E)^{-\Delta\Delta C_P}$, where $E$ is the PCR efficiency.
# Our experiment
We will work with data from a pancreatic tumor experiment where
an influence of [spirulina](https://en.wikipedia.org/wiki/Spirulina_(dietary_supplement)) algae was investigated.
In this experiment there are two sample groups:
- **control**: samples from tumor.
- **spirulina**: samples from tumor treated by extract from spirulina.
Each sample group has three biological replicates (grown on different Petri dish). In addition, each biological replicate is cultivated for 50% and 90% confluence, and for each confluence there are two technical replicates. Uhhh, a nested design, right...

But we can expand it to:
```
- control group
|
|-- biological replicate #1 (Petri dish #1)
| |-- confluence 50
| | |-- technical replicate #1 -> sample C1_C50_R1
| | |-- technical replicate #2 -> sample C2_C50_R2
| |-- confluence 90
| |-- technical replicate #1 -> sample C1_C90_R1
| |-- technical replicate #2 -> sample C1_C90_R2
|-- biological replicate #2 (Petri dish #2)
|-- confluence 50
| |-- technical replicate #1 -> sample C2_C50_R1
| |-- technical replicate #2 -> sample C2_C50_R2
|-- confluence 90
|-- ...
- spirulina group
|
|-- biological replicate #1 (Petri dish #1)
| |-- confluence 50
| | |-- technical replicate #1 -> sample S1_C50_R1
| | |-- technical replicate #2 -> sample S1_C50_R2
| |-- confluence 90
| |-- technical replicate #1 -> sample S1_C90_R1
| |-- technical replicate #2 -> sample S1_C90_R2
|-- biological replicate #2 (Petri dish #2)
|-- confluence 50
| |-- technical replicate #1 -> sample S2_C50_R1
| |-- technical replicate #2 -> sample S2_C50_R2
|-- confluence 90
|-- ...
```
Also, there are two groups of measured genes:
1. **Target genes** - interesting genes found prior by microarray exploratory analysis.
2. **Housekeeping genes** - used as reference genes.
# Libraries
```{r}
library(here)
library(tidyverse)
library(dplyr)
library(glue)
library(magrittr)
```
Now we source your personal library:
```{r}
source(here("age_library.R"))
```
# Config
I would recommend you to always define constants at the beginning of a script.
It is more transparent, and if you, for example, change some path, you haven't to replace it in whole source code.
Constants are, by convention, named in UPPERCASE.
```{r}
BASE_DIR <- here("E03-qPCR")
CP_DATA <- here(BASE_DIR, "data/qPCR.Rds")
```
# Data preprocessing
## $C_P$ matrix
We have already preprocessed a $C_P$ matrix for you.
But if you really want to start from the floor, you can create the $C_P$ matrix from original files
([here](data/2015-06-08_Strnad.txt) and [here](data/2015-06-08_Strnad.xls)).
Not very pretty, right? `r emo::ji("face_with_hand_over_mouth")`
In $C_P$ matrix, columns are samples, rows are genes, and values are $C_P$ values.
This format is generally used for gene expression data and you will see it also in case of microarrays and RNA-seq.
```{r}
cp <- readRDS(CP_DATA) %>%
as.data.frame()
genes <- rownames(cp)
samples <- colnames(cp)
cp[1:5, 1:5]
```
CP values `>= 40` are considered out of qPCR limit and so we set them to `NA`:
```{r}
cp[cp > 39.9] <- NA
```
Some genes have so many `NA` values (respectively, they don't have many values other than `NA`):
```{r}
lq_genes <- rowSums(!is.na(cp))
lq_genes
```
Let's remove genes with number of non-NA values less or equal to 2:
```{r}
lq_genes <- names(lq_genes[lq_genes <= 2])
genes <- setdiff(genes, lq_genes)
cp <- cp[genes, ]
```
Now we split genes to target and reference ones.
The latter have `H<number>` on the end of their names, which we can match with regular expression (regex) `H\d+$`.
```{r}
ref_genes <- str_subset(genes, "H\\d+$")
target_genes <- setdiff(genes, ref_genes)
gene_groups <- if_else(genes %in% ref_genes, "reference", "target")
genes_df <- data.frame(
gene = as.character(genes),
gene_group = factor(gene_groups, levels = c("reference", "target")),
row.names = genes,
stringsAsFactors = FALSE
)
genes_df
```
For plotting purposes, we create a new dataframe with `NA`s replaced by `40`:
```{r}
cp_plot <- replace(cp, is.na(cp), 40)
```
## Phenotypical data (sample sheet)
Along with measured data you usually obtain a sample sheet, but in this case we are going to parse the column names of the $C_P$ matrix.
Everything needed is contained there:
- K and SP: sample groups
- Number behind sample group: Petri dish (= biological replicate)
- Number behind slash ("/"): confluence
- (2RT): second technical replicate
`str_match()` is returning a character matrix of capturing groups (columns) for each string (rows).
The first column is always the full match.
```{r}
(samples <- colnames(cp))
pheno_data <- data.frame(
sample_id = samples,
# Match with a single capturing group (full match): "SP" or "K" on the beginning.
sample_group_code = str_match(samples, "^SP|K")[, 1] %>% recode_factor("K" = "c", "SP" = "sp"),
# Match a sample group on the beginning and then the number behind (second group).
petri_number = str_match(samples, "^[A-Z]{1,2}(\\d)")[, 2] %>% as.factor(),
# Match a two-digit number behind "/".
confluence = str_match(samples, "\\/(\\d{2})")[, 2] %>% as.factor(),
# Detect a second technical replicate. See how literal match of parentheses ("(" and ")") must be written:
# first you have to escape backslash in R as "\\" to escape parenthese for regex as "\\(", which will match the character "(" literally,
# e.g. not specifying the regex capturing group.
replicate = if_else(str_detect(samples, "\\(2RT\\)"), 2L, 1L) %>% as.factor()
) %>%
dplyr::mutate(
sample_name_rep = glue("{sample_group_code}{petri_number}_{confluence}_r{replicate}") %>% as.character(),
sample_name = glue("{sample_group_code}{petri_number}_{confluence}") %>% as.character(),
sample_group = recode_factor(sample_group_code, "c" = "control", "sp" = "spirulina")
) %>%
dplyr::select(sample_name_rep, sample_name, everything()) %>%
set_rownames(.$sample_name_rep)
colnames(cp) <- rownames(pheno_data)
colnames(cp_plot) <- rownames(pheno_data)
samples <- colnames(cp)
head(pheno_data)
```
If you are not sure what the regexes above mean, refer to the `stringr / Regular expressions` section in E02 - Intro to R.
## Long data
We also create long data from `cp`, `pheno_data` and `genes_df`:
```{r}
cp_long <- as.data.frame(cp) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name_rep", values_to = "cp") %>%
dplyr::left_join(pheno_data, by = "sample_name_rep") %>%
dplyr::left_join(genes_df, by = "gene") %>%
dplyr::mutate(
cp_plot = if_else(is.na(cp), 40, cp)
) %>%
dplyr::select(sample_name_rep, sample_name, gene, cp, cp_plot, gene_group, everything())
head(cp_long)
```
# Exploratory analysis on raw data
We are using exploratory analysis to have an unbiased first look at data.
It is also a crucial step to assess the biological quality control,
e.g. whether samples from distinct groups create separate clusters -
if not, samples might be swapped etc.
## Hiearchical clustering `r emo::ji("hammer_and_wrench")`
> Function to be implemented within the assignment.
It takes an expression matrix and returns a dendrogram with coloring by a chosen variable.
I recommend you to use the [dendextend](https://cran.r-project.org/web/packages/dendextend/index.html) package
to modify a dendrogram, it is much easier than doing so in base R.
```{r}
coloring_variables <- c("sample_group", "confluence", "petri_number")
for (variable in coloring_variables) {
plot_hc(
cp_plot, color_by = pheno_data[, variable],
method_distance = "euclidean", method_clustering = "complete",
color_by_lab = variable
)
}
```
We can see the main effect - with several exceptions, samples are clustered by sample groups, which is good.
There isn't any effect of other variables, which would be confounding.
## PCA `r emo::ji("hammer_and_wrench")`
> Function to be implemented within the assignment.
It will perform PCA visualization with point coloring by a chosen variable.
You can also output plots of first N principal components and variance explained bar plot.
We can see that the first principal component (PC1) is nicely separating our samples by sample group:
```{r}
plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence")$plot
plot_pca(cp_plot, pheno_data, plot_type = "multi", color_by = "sample_group", shape_by = "confluence")$plot
```
```{r, height = 8}
plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name_rep")$plot
```
```{r}
plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "petri_number")$plot
```
- You can also use `GGally::ggpairs()` to plot a grid of PCA plots (PC1 vs. PC2, PC1 vs. PC3, PC2 vs. PC3, etc.):
```{r}
plot_pca_ggpairs(cp_plot, pheno_data, n_components = 5, color_by = "sample_group", shape_by = "confluence")$plot
```
## Heatmap `r emo::ji("hammer_and_wrench")`
> Function to be implemented within the assignment.
`ComplexHeatmap` package is preferred. For interactive heatmaps you can use
[heatmaply](https://cran.r-project.org/web/packages/heatmaply/vignettes/heatmaply.html) (introduced in E02 - Intro to R).
```{r}
p_heatmap <- plot_heatmap(
cp,
column_annotation = dplyr::select(pheno_data, `Sample Group` = sample_group, `Petri Dish` = petri_number, Confluence = confluence, Replicate = replicate),
row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group),
legend_title = "Cp",
show_column_names = FALSE
)
draw(p_heatmap, merge_legend = TRUE)
```
**Can you see any outlying samples (replicates)? If yes, should they be removed?**
# Selecting the reference genes
We will use housekeeping genes as reference genes.
Housekeeping genes are genes which have similar expression profile regardless the cell type or state.
That is why they can be used for sample normalization (as an internal control).
## Clustering
Housekeeping genes should cluster together. Let's try it using hierachical clustering:
```{r}
plot_hc(t(cp), color_by = gene_groups, color_by_lab = "Gene groups")
```
Also in PCA, we should see them separated from target genes:
```{r}
plot_pca(t(cp_plot), sample_data = genes_df, color_by = "gene_group")$plot
```
Not very sharp clustering, but PC1 separates the gene groups well.
## Correlation
Expression of housekeeping genes is/should be correlated.
```{r}
main_title <- "Raw data correlation"
t(cp) %>%
as.data.frame() %>%
GGally::ggpairs(progress = FALSE) +
theme_bw()
```
Let's look more closely at the correlation coefficients:
```{r}
(gene_cor <- cor(t(cp), use = "pairwise.complete.obs"))
ggcorrplot::ggcorrplot(gene_cor, method = "circle") +
ggtitle(main_title)
```
You can also use the `heatmaply` package:
```{r}
plot_heatmaply(gene_cor, row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group), main = main_title, legend_title = "Correlation")
```
At the first sight, housekeeping (reference) genes are well correlated, with correlation coefficients over 0.9
Housekeeping genes should cluster together also by their correlation coefficients.
We can try both Euclidean and [Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Pearson's_distance)
distance (1 - correlation coefficient) for hiearchical clustering.
```{r}
plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")
plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")
```
PCA of correlation coefficients is also helpful:
```{r}
plot_pca(gene_cor, sample_data = genes_df, color_by = "gene_group")$plot
```
From both hiearchical clustering and PCA you can see that reference genes make a compact cluster, which is good.
## geNorm M-values `r emo::ji("hammer_and_wrench")`
> Function (`compute_m()`) to be implemented within the assignment.
Pseudocode can be found in documentation for this function in `age_library.R`.
Your function should output values similar to those below (or at least numerically very close to them).
Until now we were doing rather exploratory analysis of reference genes, but it's time to make a more rigorous decision.
M values are telling you how stable is a gene expressed across the samples.
From the original [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC126239/):
> "This measure relies on the principle that the expression ratio of two ideal internal control genes is identical in all samples,
regardless of the experimental condition or cell type. In this way, variation of the expression ratios of two real-life
housekeeping genes reflects the fact that one (or both) of the genes is (are) not constantly expressed,
with increasing variation in ratio corresponding to decreasing expression stability."
For a further explanation see the section "Gene-stability measure and ranking of selected housekeeping genes" in the paper above.
Hints for geNorm implementation:
- Remember that R is vectorized. In this way you can directly write most of equations as you see them on paper.
- You will most likely need the `apply()` function.
Let's compute the M values:
```{r}
purrr::map_dbl(ref_genes, ~ compute_m(., cp_plot[ref_genes, ])) %>%
set_names(ref_genes)
```
All reference genes seem to be OK. But compare them with target genes - they would not be good reference genes:
```{r}
purrr::map_dbl(target_genes, ~ compute_m(., cp_plot[target_genes, ])) %>%
set_names(target_genes)
```
***
# Normalization to reference genes: $\Delta C_P$
For each sample we calculate a geometric mean of reference genes.
Don't be confused, it's just [this](https://en.wikipedia.org/wiki/Geometric_mean#Relationship_with_logarithms)
formula applied to reference genes of each sample.
```{r}
norm <- apply(cp[ref_genes, ], 2, function(x) {
log(x) %>% mean() %>% exp()
})
head(norm)
```
Let's look how these geometric means of reference genes behave.
As you can see, they lie in the centre of reference genes, and that's what we want (and expect).
```{r}
gene_cor_norm <- rbind(cp, norm = norm) %>%
t() %>%
cor(use = "pairwise.complete.obs")
genes_df_norm <- genes_df %>%
dplyr::mutate(gene_group = factor(gene_group, levels = c("reference", "target", "norm"))) %>%
rbind(norm = c("norm", "norm"))
ggcorrplot::ggcorrplot(gene_cor_norm, method = "circle") +
ggtitle(main_title)
plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")
plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")
plot_pca(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot
```
Now we calculate $\Delta C_P$ values by subtracting the reference gene means from target genes.
First, we create a matrix of the same size as `cp`, with reference gene mean of each sample in columns.
```{r}
ref_gene_cp_matrix <- matrix(norm, ncol = ncol(cp), nrow = nrow(cp), byrow = TRUE)
ref_gene_cp_matrix[1:5, 1:5]
```
Now we are ready for subtraction, that is, to calculate $\Delta C_P = C_P^t - C_{P,mean}^r$:
```{r}
delta_cp <- cp - ref_gene_cp_matrix
delta_cp[1:5, 1:5]
```
Now we average the technical replicates.
We will use a handy function from `limma` package, which we will use later for microarray data analysis.
`avearrays()` is basically doing this: splits columns to groups, calculates within group average of each gene and concatenates the results to matrix or dataframe.
We split our samples by `sample_name` column in `pheno_data`. You can see that for each unique `sample_name`, there is either `replicate` 1 or 2.
```{r}
head(pheno_data)
delta_cp_avg <- limma::avearrays(delta_cp, pheno_data$sample_name)
delta_cp_avg[1:5, 1:5]
```
Just for control:
```{r}
(rowMeans(delta_cp[, c("c1_50_r1", "c1_50_r2")], na.rm = TRUE) == delta_cp_avg[, "c1_50"]) %>%
all() %>%
stopifnot()
```
Let's finalize the $\Delta C_P$ calculation.
We keep only target genes and again, create a separate matrix for plotting purposes.
We also add $\Delta C_P$ to our long data.
```{r}
delta_cp_avg <- delta_cp_avg[target_genes, ]
delta_cp_avg_plot <- replace(delta_cp_avg, is.na(delta_cp_avg), ceiling(max(delta_cp_avg, na.rm = TRUE) + 1))
```
Remove replicates from `pheno_data` and `cp_long`:
```{r}
pheno_data_avg <- dplyr::filter(pheno_data, replicate == 1) %>%
dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
dplyr::select(sample_name, everything()) %>%
set_rownames(.$sample_name)
cp_long <- dplyr::filter(cp_long, replicate == 1) %>%
dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
dplyr::select(sample_name, everything())
```
And add computed $\Delta C_P$ to long data:
```{r}
cp_long <- as.data.frame(delta_cp_avg) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp") %>%
dplyr::right_join(cp_long, by = c("sample_name", "gene"))
cp_long <- as.data.frame(delta_cp_avg_plot) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp_plot") %>%
dplyr::right_join(cp_long, by = c("sample_name", "gene"))
cp_long_target <- dplyr::filter(cp_long, gene_group == "target")
head(cp_long_target)
```
As you can see, the code above is quite repetitive - better would be to write a function for that `r emo::ji("slightly_smiling_face")`
***
# Exploratory analysis on summarised and normalised data
Now when our data are normalized, we can check for sample outliers, as these could mess up differential expression analysis.
```{r}
plot_pca(delta_cp_avg_plot, pheno_data_avg, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name")$plot
for (variable in coloring_variables) {
plot_hc(
delta_cp_avg_plot,
color_by = pheno_data_avg[, variable],
method_distance = "euclidean",
method_clustering = "complete",
color_by_lab = variable
)
}
plot_heatmap(delta_cp_avg_plot, column_annotation = pheno_data_avg[, coloring_variables], title = "qPCR", legend_title = "deltaCp") %>%
draw(merge_legend = TRUE)
```
We can see that samples `c1_90`, `sp3_50` and `sp1_90` are a little bit suspicious.
Stricter people would consider their removal, but that is the last solution if final results are not satisfactory.
In biology, one has to always keep in mind its natural variation.
***
# Calculating ($log_2$) fold-changes: $-\Delta\Delta C_P$
Now we calculate the **relative** change in gene expression between sample groups.
For the sake of simplicity, we don't take into account other variables, e.g. confluence.
Also from the exploratory analysis you can see that these variables don't have much effect
on the separation of samples - sample group has the largest influence.
First, we calculate the $mean(C_P)$ of control samples:
```{r}
(control_means <- cp_long_target %>%
dplyr::filter(sample_group == "control") %>%
dplyr::group_by(gene) %>%
dplyr::summarise(delta_cp_control_mean = mean(delta_cp, na.rm = TRUE))
)
```
Then we join this table with our long data and calculate:
- $\Delta \Delta C_P = \Delta C_{P, treatment} - \text{mean}(\Delta C_{P, control})$
- $log_2(\text{fold-change}) = -\Delta \Delta C_P$
- $\text{fold-change} = 2^{LFC}$
```{r}
cp_long_target <- cp_long_target %>%
dplyr::left_join(control_means, by = "gene") %>%
dplyr::mutate(
delta_delta_cp = delta_cp - delta_cp_control_mean,
lfc = -delta_delta_cp,
fc = 2^lfc
) %>%
dplyr::select(sample_name, gene, cp, cp_plot, delta_cp, delta_cp_plot, delta_cp_control_mean, delta_delta_cp, lfc, fc, everything())
head(cp_long_target)
```
$\text{mean}(\Delta \Delta C_{P, control})$ is now $0$, because we subtracted $\text{mean}(\Delta C_P, control)$
also from the control samples. Of course this also applies for $LFC$, which is just $-\Delta \Delta C_P$.
As you will see, this is useful for plotting (boxplots).
# Boxplots `r emo::ji("hammer_and_wrench")`
Boxplots are very informative when you want to see a difference between groups.
> Function to be implemented within the assignment.
It should work for any type of long data, as you will be able to specify
the grouping column to split boxplots on x-axe, and column with values to plot on y-axe.
Tip: you can use [ggpubr::ggboxplot()](https://rpkgs.datanovia.com/ggpubr/reference/ggboxplot.html)
First we will plot the $\Delta C_P$ values:
```{r}
plot_boxplots(
cp_long_target,
x = "sample_group",
y = "delta_cp",
facet_by = "gene",
color_by = "sample_group",
main = "delta Cp values",
y_lab = "delta Cp"
)
```
And this is tricky to interpret at the first sight, because in spirulina group, all genes are actually upregulated!
Also, you cannot directly see the $LFCs$ - in your head, you have to calculate the difference between spirulina and control
and flip the sign.
So better is to plot the $LFCs$. Why not the $\text{fold-changes}$?
Because $LFCs$ are easier to read, as changes (relative to controls) are multiplicative and symmetric:
<div id="div_lfc_table">
---------------------------- --------- --------- --------- ----------- --------- --------- ---------
$\text{fold-change}$ 0.125 0.25 0.5 1 2 4 8
$log_2(\text{fold-change})$ -3 -2 -1 0 1 2 3
You read: Change is... 8x less 4x less 2x less no change 2x more 4x more 8x more
---------------------------- --------- --------- --------- ----------- --------- --------- ---------
</div>
This figure summarises how $LFCs$ were calculated and why it is better for boxplots `r emo::ji("nerd_face")`
`)
Now you can clearly see that spirulina group have upregulated genes:
```{r}
plot_boxplots(
cp_long_target,
x = "sample_group",
y = "lfc",
facet_by = "gene",
color_by = "sample_group",
main = "log2 fold-changes",
y_lab = "log2 fold-change"
)
```
We can also split plots by confluence:
```{r}
plot_boxplots(
cp_long_target,
x = "sample_group",
y = "lfc",
facet_by = "gene",
color_by = "confluence",
main = "log2 fold-changes",
y_lab = "log2 fold-change"
)
```
Alternatively to boxplots, we can make barplots, which are quite popular in papers `r emo::ji("slightly_smiling_face")`
```{r}
ggpubr::ggbarplot(
cp_long_target,
x = "sample_group",
y = "fc",
add = c("mean", "jitter"),
fill = "sample_group",
facet.by = "gene"
) +
scale_y_continuous("Relative concentration", labels = scales::percent) +
labs(x = NULL, fill = "Sample Group")
```
Or we can plot a heatmap of LFC values or their z-scores:
```{r}
lfc_wide <- cp_long_target %>%
dplyr::select(sample_name, gene, lfc) %>%
pivot_wider(names_from = sample_name, values_from = lfc) %>%
column_to_rownames(var = "gene")
lfc_wide[, 1:5]
plot_heatmap(
lfc_wide,
column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
title = "LFC between spirulina and control",
legend_title = "LFC",
cluster_rows = FALSE,
cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)
plot_heatmap(
lfc_wide,
z_score = TRUE,
column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
title = "LFC between spirulina and control",
legend_title = "LFC z-score",
cluster_rows = FALSE,
cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)
```
# t-test `r emo::ji("hammer_and_wrench")`
> Functions to be implemented within the assignment. You should implement both `test_gene()` and `test_gene_table()`.
If you do it smart, you can just call the former from the latter.
Hint: look at the `formula` parameter of `t.test()` function.
t-test is testing whether two means, coming from samples having Student's distribution, significantly differ.
t-test assumes that the input $C_P$ values are normally distributed and the variance between conditions is comparable.
Wilcoxon test can be used when sample size is small and those two last assumptions are hard to achieve.
As we will use two-sided alternative (i.e. gene is up- or downregulated = deregulated),
it doesn't matter which one of $\Delta C_P$, $\Delta \Delta C_P$ or $LFC$ we will use.
In all cases, the absolute difference between group means remains the same:
```{r}
group_means <- dplyr::filter(cp_long_target, gene == "CD24") %>%
dplyr::group_by(sample_group) %>%
dplyr::summarise(
delta_cp_mean = mean(delta_cp),
delta_delta_cp_mean = mean(delta_delta_cp),
lfc_mean = mean(lfc)
) %>%
as.data.frame() %>%
tibble::column_to_rownames("sample_group")
group_means
apply(group_means, MARGIN = 2, FUN = function(x) abs(x[1] - x[2]))
```
```{r, message = TRUE}
for (g in target_genes) {
test_gene(
gene = g,
gene_data = cp_long_target,
gene_col = "gene",
value_col = "lfc",
group_col = "sample_group",
test = t.test,
verbose = TRUE
)
}
```
But better is to return a table for all genes:
```{r}
test_table <- test_gene_table(
gene_data = cp_long_target,
gene_col = "gene",
value_col = "delta_cp",
group_col = "sample_group"
)
test_table
```
***
# Cleanup
```{r, warning = TRUE, message = TRUE}
save(list = ls(all.names = TRUE), file = here(BASE_DIR, "qPCR.RData"), envir = environment())
warnings()
traceback()
sessioninfo::session_info()
```
***
***
# HTML rendering
This chunk is not evaluated (`eval = FALSE`). Otherwise you will probably end up in recursive hell `r emo::ji("exploding_head")`
```{r, eval = FALSE, message = FALSE, warning = FALSE}
library(conflicted)
library(knitr)
library(glue)
library(here)
if (!require(rmdformats)) {
BiocManager::install("rmdformats")
}
# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = TRUE)
rmarkdown::render(
here("E03-qPCR/qPCR.Rmd"),
output_file = here("E03-qPCR/qPCR.html"),
envir = new.env(),
knit_root_dir = here()
)
```