Exercises/E03-qPCR/qPCR.Rmd

---
title: "E03 - qPCR (Analysis of Gene Expression @ UCT Prague)"
author:
  - Jiri Novotny jiri.novotny@img.cas.cz
  - Studuj bioinformatiku! http://studuj.bioinformatiku.cz
institute: "Laboratory of Genomics and Bioinformatics @ Institute of Molecular Genetics of the ASCR"
output:
  rmdformats::readthedown:
    highlight: "kate"
    lightbox: true
    thumbnails: true
    gallery: true
    toc_depth: 4
    self_contained: true
    number_sections: false
    toc_collapsed: false
    df_print: "paged"
date: "`r Sys.Date()`"
---

```{r, child = here::here("_assets/custom.Rmd"), eval = TRUE}
```

```{js, echo = FALSE}
$(document).ready(function() {
  $("#div_lfc_table table").addClass("table table-condensed");
});
```

***

Copy, please, these files and directories to your personal directory:

```{bash, eval = FALSE}
cp -r ~/shared/AGE_current/Exercises/{E03-qPCR,age_library.R} ~/AGE/Exercises
```

**Assignment**

The assignment for this exercise is to implement (mainly visualization) functions, which you will use later on different types of gene expression data.
You will implement these functions in `age_library.R` where are empty function definitions.
You don't have to stick to all predefined parameters and returned variables, just keep in mind that your implementations should make outputs similar to outputs in this document.
Also, in several cases there are alternative functions (such as `plot_pca()` and `plot_pca_ggpairs()`), but you don't have to implement them all, if not said.

Sections containing functions to be implemented are marked with `r emo::ji("hammer_and_wrench")`
Anyway, here is the minimum set of functions you need to implement (arguments are up to you, it's about the output):

- `plot_hc(m, color_by)`:
  Calculates hiearchical clustering of matrix `m` columns and plot a dendrogram, whose labels are colored according to `color_by` vector.
- `plot_pca(m, sample_data, color_by)`:
  Calculates PCA of matrix `m` and makes a scatterplot of first two principal compoments with points colored by `color_by` column in `sample_data`.
- `plot_heatmap(m, column_annotation, z_score = FALSE)`:
  Makes a heatmap of matrix `m` with column annotation specified in `column_annotation` dataframe.
  Optionally, converts values in `m` to z-score (row-wise).
- `plot_boxplots(plot_data, x, y)`:
  Using long `plot_data`, plots boxplots of values in `y` column, splitted to groups defined in `x` column.
- `compute_m(gene, cp)`:
  Computes the M value of $C_P$ values (geNorm algorithm).
  Pseudocode can be found in documentation for this function in `age_library.R`.
- `test_gene(gene, gene_data, gene_col, value_col, group_col)`:
  Using long `gene_data`, calculates between group statistical test (e.g. t-test) of `gene`
  defined in `gene_col` column with values in `value_col` column, and groups in `group_col` column.
- `test_gene_table(gene_data, gene_col, value_col, group_col)`:
  The same as `test_gene()`, but calculates tests for all genes in `gene_col` column and returns a dataframe of results.

**After you implement these functions, modify their calling in this document and render it again.**
Obviously, you need to remove sections with functions you haven't implemented.

Please, return the assignment in this form: 

- In the header, put your name and e-mail to `author` parameter (under `params`).
- Copy all finished files to directory named `E03-qPCR_assignment_<your_name>`: modified Rmd, rendered HTML, `age_library.R` with your implementations.
  Put this directory to ZIP archive named `E03-qPCR_assignment_<your_name>.zip`.
  - In rendered HTML there should be included all code chunks with full output (warnings, messages, etc.).
- Upload the ZIP file to the appropriate assignment in MS Teams.

> Note that your functions will be universal, serving not only for qPCR data.
  This is mainly true for exploratory analysis functions (hiearchical clustering, PCA, etc.),
  where input (expression matrix) is the same also for microarray and RNA-seq data.

***

# Introduction

[Quantitative real-time PCR](https://en.wikipedia.org/wiki/Real-time_polymerase_chain_reaction) monitors the amplification
of a targeted DNA molecule during the PCR (i.e., in real time), not at its end, as in conventional PCR.

![qPCR measurement](`r here("E03-qPCR/_rmd_images/qpcr_curves.png")`)

In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules.
This amplification depends on initial RNA concentration and approximates the gene expression.
We select a fluorescence signal threshold $S_T$ and measure crossing points $C_P$ (also called threshold cycle $C_T$) for samples.
To correct for initial RNA concentration and quality, reference genes are used (dashed curves).
These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline.
First, we normalize target genes in each sample to reference gene (or they geometric mean) by calculating $\Delta C_P = C_P^t - C_P^r$.
Obtained $\Delta C_P$ values are then used to calculate the <ins>relative change</ins> of gene expression between samples.
In this case, we are interested how much more or less is gene expressed in $\text{treatment}$ sample ($\text{trt}$), relative to $\text{control}$ ($\text{cnt}$) sample.
This relative change is called $\text{fold-change}$: $FC = 2^{-\Delta\Delta C_P} = 2^{-(\Delta C_{P, trt} - \Delta C_{P, cnt})}$.
We usually report $FC$ in $log_2$ scale: $LFC = \Delta C_{P, cnt} - \Delta C_{P, trt} = log_2(FC)$
Important is to realize that the higher $C_P$ is, the **lower** initial mRNA concentration was.
This is why we put minus /$-$/ in front of $\Delta\Delta C_P$.
We usually don't consider the efficiency of PCR and expect the mRNA concentration to double in each cycle.
Otherwise $FC = (1+E)^{-\Delta\Delta C_P}$, where $E$ is the PCR efficiency.

# Our experiment

We will work with data from a pancreatic tumor experiment where
an influence of [spirulina](https://en.wikipedia.org/wiki/Spirulina_(dietary_supplement)) algae was investigated.
In this experiment there are two sample groups:

- **control**: samples from tumor.
- **spirulina**: samples from tumor treated by extract from spirulina.

Each sample group has three biological replicates (grown on different Petri dish). In addition, each biological replicate is cultivated for 50% and 90% confluence, and for each confluence there are two technical replicates. Uhhh, a nested design, right...

![](https://i.giphy.com/media/LnKa2WLkd6eAM/giphy.webp)

But we can expand it to:

```
- control group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample C1_C50_R1
  |   |   |-- technical replicate #2 -> sample C2_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample C1_C90_R1
  |       |-- technical replicate #2 -> sample C1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample C2_C50_R1
      |   |-- technical replicate #2 -> sample C2_C50_R2
      |-- confluence 90
          |-- ...

- spirulina group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample S1_C50_R1
  |   |   |-- technical replicate #2 -> sample S1_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample S1_C90_R1
  |       |-- technical replicate #2 -> sample S1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample S2_C50_R1
      |   |-- technical replicate #2 -> sample S2_C50_R2
      |-- confluence 90
          |-- ...
```

Also, there are two groups of measured genes:

1. **Target genes** - interesting genes found prior by microarray exploratory analysis.
2. **Housekeeping genes** - used as reference genes.

# Libraries

```{r}
library(here)
library(tidyverse)
library(dplyr)
library(glue)
library(magrittr)
```

Now we source your personal library:

```{r}
source(here("age_library.R"))
```

# Config

I would recommend you to always define constants at the beginning of a script.
It is more transparent, and if you, for example, change some path, you haven't to replace it in whole source code.
Constants are, by convention, named in UPPERCASE.

```{r}
BASE_DIR <- here("E03-qPCR")
CP_DATA <- here(BASE_DIR, "data/qPCR.Rds")
```

# Data preprocessing

## $C_P$ matrix

We have already preprocessed a $C_P$ matrix for you.
But if you really want to start from the floor, you can create the $C_P$ matrix from original files
([here](data/2015-06-08_Strnad.txt) and [here](data/2015-06-08_Strnad.xls)).
Not very pretty, right? `r emo::ji("face_with_hand_over_mouth")`

In $C_P$ matrix, columns are samples, rows are genes, and values are $C_P$ values.
This format is generally used for gene expression data and you will see it also in case of microarrays and RNA-seq.

```{r}
cp <- readRDS(CP_DATA) %>%
  as.data.frame()
genes <- rownames(cp)
samples <- colnames(cp)

cp[1:5, 1:5]
```

CP values `>= 40` are considered out of qPCR limit and so we set them to `NA`:

```{r}
cp[cp > 39.9] <- NA
```

Some genes have so many `NA` values (respectively, they don't have many values other than `NA`):

```{r}
lq_genes <- rowSums(!is.na(cp))
lq_genes
```

Let's remove genes with number of non-NA values less or equal to 2:

```{r}
lq_genes <- names(lq_genes[lq_genes <= 2])
genes <- setdiff(genes, lq_genes)
cp <- cp[genes, ]
```

Now we split genes to target and reference ones.
The latter have `H<number>` on the end of their names, which we can match with regular expression (regex) `H\d+$`.

```{r}
ref_genes <- str_subset(genes, "H\\d+$")
target_genes <- setdiff(genes, ref_genes)
gene_groups <- if_else(genes %in% ref_genes, "reference", "target")
genes_df <- data.frame(
  gene = as.character(genes),
  gene_group = factor(gene_groups, levels = c("reference", "target")),
  
  row.names = genes,
  stringsAsFactors = FALSE
)
genes_df
```

For plotting purposes, we create a new dataframe with `NA`s replaced by `40`:

```{r}
cp_plot <- replace(cp, is.na(cp), 40)
```

## Phenotypical data (sample sheet)

Along with measured data you usually obtain a sample sheet, but in this case we are going to parse the column names of the $C_P$ matrix.
Everything needed is contained there:

- K and SP: sample groups
- Number behind sample group: Petri dish (= biological replicate)
- Number behind slash ("/"): confluence
- (2RT): second technical replicate

`str_match()` is returning a character matrix of capturing groups (columns) for each string (rows).
The first column is always the full match.

```{r}
(samples <- colnames(cp))

pheno_data <- data.frame(
  sample_id = samples,
  # Match with a single capturing group (full match): "SP" or "K" on the beginning.
  sample_group_code = str_match(samples, "^SP|K")[, 1] %>% recode_factor("K" = "c", "SP" = "sp"),
  # Match a sample group on the beginning and then the number behind (second group).
  petri_number = str_match(samples, "^[A-Z]{1,2}(\\d)")[, 2] %>% as.factor(),
  # Match a two-digit number behind "/".
  confluence = str_match(samples, "\\/(\\d{2})")[, 2] %>% as.factor(),
  # Detect a second technical replicate. See how literal match of parentheses ("(" and ")") must be written:
  # first you have to escape backslash in R as "\\" to escape parenthese for regex as "\\(", which will match the character "(" literally,
  # e.g. not specifying the regex capturing group.
  replicate = if_else(str_detect(samples, "\\(2RT\\)"), 2L, 1L) %>% as.factor()
) %>%
  dplyr::mutate(
    sample_name_rep = glue("{sample_group_code}{petri_number}_{confluence}_r{replicate}") %>% as.character(),
    sample_name = glue("{sample_group_code}{petri_number}_{confluence}") %>% as.character(),
    sample_group = recode_factor(sample_group_code, "c" = "control", "sp" = "spirulina")
  ) %>%
  dplyr::select(sample_name_rep, sample_name, everything()) %>%
  set_rownames(.$sample_name_rep)

colnames(cp) <- rownames(pheno_data)
colnames(cp_plot) <- rownames(pheno_data)
samples <- colnames(cp)

head(pheno_data)
```

If you are not sure what the regexes above mean, refer to the `stringr / Regular expressions` section in E02 - Intro to R.

## Long data

We also create long data from `cp`, `pheno_data` and `genes_df`:

```{r}
cp_long <- as.data.frame(cp) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name_rep", values_to = "cp") %>%
  dplyr::left_join(pheno_data, by = "sample_name_rep") %>%
  dplyr::left_join(genes_df, by = "gene") %>%
  dplyr::mutate(
    cp_plot = if_else(is.na(cp), 40, cp)
  ) %>%
  dplyr::select(sample_name_rep, sample_name, gene, cp, cp_plot, gene_group, everything())

head(cp_long)
```

# Exploratory analysis on raw data

We are using exploratory analysis to have an unbiased first look at data.
It is also a crucial step to assess the biological quality control,
e.g. whether samples from distinct groups create separate clusters -
if not, samples might be swapped etc.

## Hiearchical clustering `r emo::ji("hammer_and_wrench")`

> Function to be implemented within the assignment.
  It takes an expression matrix and returns a dendrogram with coloring by a chosen variable.
  I recommend you to use the [dendextend](https://cran.r-project.org/web/packages/dendextend/index.html) package
  to modify a dendrogram, it is much easier than doing so in base R.

```{r}
coloring_variables <- c("sample_group", "confluence", "petri_number")

for (variable in coloring_variables) {
  plot_hc(
    cp_plot, color_by = pheno_data[, variable],
    method_distance = "euclidean", method_clustering = "complete",
    color_by_lab = variable
  )
}
```

We can see the main effect - with several exceptions, samples are clustered by sample groups, which is good.
There isn't any effect of other variables, which would be confounding.

## PCA `r emo::ji("hammer_and_wrench")`

> Function to be implemented within the assignment.
  It will perform PCA visualization with point coloring by a chosen variable.
  You can also output plots of first N principal components and variance explained bar plot.

We can see that the first principal component (PC1) is nicely separating our samples by sample group:

```{r}
plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence")$plot
plot_pca(cp_plot, pheno_data, plot_type = "multi", color_by = "sample_group", shape_by = "confluence")$plot
```

```{r, height = 8}
plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name_rep")$plot
```

```{r}
plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "petri_number")$plot
```

- You can also use `GGally::ggpairs()` to plot a grid of PCA plots (PC1 vs. PC2, PC1 vs. PC3, PC2 vs. PC3, etc.):

```{r}
plot_pca_ggpairs(cp_plot, pheno_data, n_components = 5, color_by = "sample_group", shape_by = "confluence")$plot
```

## Heatmap `r emo::ji("hammer_and_wrench")`

> Function to be implemented within the assignment.
  `ComplexHeatmap` package is preferred. For interactive heatmaps you can use
  [heatmaply](https://cran.r-project.org/web/packages/heatmaply/vignettes/heatmaply.html) (introduced in E02 - Intro to R).

```{r}
p_heatmap <- plot_heatmap(
  cp,
  column_annotation = dplyr::select(pheno_data, `Sample Group` = sample_group, `Petri Dish` = petri_number, Confluence = confluence, Replicate = replicate),
  row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group),
  legend_title = "Cp",
  show_column_names = FALSE
)
draw(p_heatmap, merge_legend = TRUE)
```

**Can you see any outlying samples (replicates)? If yes, should they be removed?**

# Selecting the reference genes

We will use housekeeping genes as reference genes.
Housekeeping genes are genes which have similar expression profile regardless the cell type or state.
That is why they can be used for sample normalization (as an internal control).

## Clustering

Housekeeping genes should cluster together. Let's try it using hierachical clustering:

```{r}
plot_hc(t(cp), color_by = gene_groups, color_by_lab = "Gene groups")
```

Also in PCA, we should see them separated from target genes:

```{r}
plot_pca(t(cp_plot), sample_data = genes_df, color_by = "gene_group")$plot
```

Not very sharp clustering, but PC1 separates the gene groups well.

## Correlation

Expression of housekeeping genes is/should be correlated.

```{r}
main_title <- "Raw data correlation"

t(cp) %>%
  as.data.frame() %>%
  GGally::ggpairs(progress = FALSE) +
  theme_bw()
```

Let's look more closely at the correlation coefficients:

```{r}
(gene_cor <- cor(t(cp), use = "pairwise.complete.obs"))
ggcorrplot::ggcorrplot(gene_cor, method = "circle") +
  ggtitle(main_title)
```

You can also use the `heatmaply` package:

```{r}
plot_heatmaply(gene_cor, row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group), main = main_title, legend_title = "Correlation")
```

At the first sight, housekeeping (reference) genes are well correlated, with correlation coefficients over 0.9

Housekeeping genes should cluster together also by their correlation coefficients.
We can try both Euclidean and [Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Pearson's_distance)
distance (1 - correlation coefficient) for hiearchical clustering.

```{r}
plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")
plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")
```

PCA of correlation coefficients is also helpful:

```{r}
plot_pca(gene_cor, sample_data = genes_df, color_by = "gene_group")$plot
```

From both hiearchical clustering and PCA you can see that reference genes make a compact cluster, which is good.

## geNorm M-values `r emo::ji("hammer_and_wrench")`

> Function (`compute_m()`) to be implemented within the assignment.
  Pseudocode can be found in documentation for this function in `age_library.R`.
  Your function should output values similar to those below (or at least numerically very close to them).

Until now we were doing rather exploratory analysis of reference genes, but it's time to make a more rigorous decision.
M values are telling you how stable is a gene expressed across the samples.
From the original [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC126239/):

> "This measure relies on the principle that the expression ratio of two ideal internal control genes is identical in all samples,
  regardless of the experimental condition or cell type. In this way, variation of the expression ratios of two real-life
  housekeeping genes reflects the fact that one (or both) of the genes is (are) not constantly expressed,
  with increasing variation in ratio corresponding to decreasing expression stability."

For a further explanation see the section "Gene-stability measure and ranking of selected housekeeping genes" in the paper above.

Hints for geNorm implementation:

- Remember that R is vectorized. In this way you can directly write most of equations as you see them on paper.
- You will most likely need the `apply()` function.

Let's compute the M values:

```{r}
purrr::map_dbl(ref_genes, ~ compute_m(., cp_plot[ref_genes, ])) %>%
  set_names(ref_genes)
```

All reference genes seem to be OK. But compare them with target genes - they would not be good reference genes:

```{r}
purrr::map_dbl(target_genes, ~ compute_m(., cp_plot[target_genes, ])) %>%
  set_names(target_genes)
```

***

# Normalization to reference genes: $\Delta C_P$

For each sample we calculate a geometric mean of reference genes.
Don't be confused, it's just [this](https://en.wikipedia.org/wiki/Geometric_mean#Relationship_with_logarithms)
formula applied to reference genes of each sample.

```{r}
norm <- apply(cp[ref_genes, ], 2, function(x) {
  log(x) %>% mean() %>% exp()
})

head(norm)
```

Let's look how these geometric means of reference genes behave.
As you can see, they lie in the centre of reference genes, and that's what we want (and expect).

```{r}
gene_cor_norm <- rbind(cp, norm = norm) %>%
  t() %>%
  cor(use = "pairwise.complete.obs")

genes_df_norm <- genes_df %>%
  dplyr::mutate(gene_group = factor(gene_group, levels = c("reference", "target", "norm"))) %>%
  rbind(norm = c("norm", "norm"))

ggcorrplot::ggcorrplot(gene_cor_norm, method = "circle") +
  ggtitle(main_title)

plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")
plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")

plot_pca(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot
```

Now we calculate $\Delta C_P$ values by subtracting the reference gene means from target genes.
First, we create a matrix of the same size as `cp`, with reference gene mean of each sample in columns.

```{r}
ref_gene_cp_matrix <- matrix(norm, ncol = ncol(cp), nrow = nrow(cp), byrow = TRUE)
ref_gene_cp_matrix[1:5, 1:5]
```

Now we are ready for subtraction, that is, to calculate $\Delta C_P = C_P^t - C_{P,mean}^r$:

```{r}
delta_cp <- cp - ref_gene_cp_matrix
delta_cp[1:5, 1:5]
```

Now we average the technical replicates.
We will use a handy function from `limma` package, which we will use later for microarray data analysis.
`avearrays()` is basically doing this: splits columns to groups, calculates within group average of each gene and concatenates the results to matrix or dataframe.
We split our samples by `sample_name` column in `pheno_data`. You can see that for each unique `sample_name`, there is either `replicate` 1 or 2.

```{r}
head(pheno_data)
delta_cp_avg <- limma::avearrays(delta_cp, pheno_data$sample_name)
delta_cp_avg[1:5, 1:5]
```

Just for control:

```{r}
(rowMeans(delta_cp[, c("c1_50_r1", "c1_50_r2")], na.rm = TRUE) == delta_cp_avg[, "c1_50"]) %>%
  all() %>%
  stopifnot()
```

Let's finalize the $\Delta C_P$ calculation.
We keep only target genes and again, create a separate matrix for plotting purposes.
We also add $\Delta C_P$ to our long data.

```{r}
delta_cp_avg <- delta_cp_avg[target_genes, ]
delta_cp_avg_plot <- replace(delta_cp_avg, is.na(delta_cp_avg), ceiling(max(delta_cp_avg, na.rm = TRUE) + 1))
```

Remove replicates from `pheno_data` and `cp_long`:

```{r}
pheno_data_avg <- dplyr::filter(pheno_data, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything()) %>%
  set_rownames(.$sample_name)

cp_long <- dplyr::filter(cp_long, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything())
```

And add computed $\Delta C_P$ to long data:

```{r}
cp_long <- as.data.frame(delta_cp_avg) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

cp_long <- as.data.frame(delta_cp_avg_plot) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp_plot") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

cp_long_target <- dplyr::filter(cp_long, gene_group == "target")

head(cp_long_target)
```

As you can see, the code above is quite repetitive - better would be to write a function for that `r emo::ji("slightly_smiling_face")`

***

# Exploratory analysis on summarised and normalised data

Now when our data are normalized, we can check for sample outliers, as these could mess up differential expression analysis.

```{r}
plot_pca(delta_cp_avg_plot, pheno_data_avg, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name")$plot

for (variable in coloring_variables) {
  plot_hc(
    delta_cp_avg_plot,
    color_by = pheno_data_avg[, variable],
    method_distance = "euclidean",
    method_clustering = "complete",
    color_by_lab = variable
  )
}

plot_heatmap(delta_cp_avg_plot, column_annotation = pheno_data_avg[, coloring_variables], title = "qPCR", legend_title = "deltaCp") %>%
  draw(merge_legend = TRUE)
```

We can see that samples `c1_90`, `sp3_50` and `sp1_90` are a little bit suspicious.
Stricter people would consider their removal, but that is the last solution if final results are not satisfactory.
In biology, one has to always keep in mind its natural variation.

***

# Calculating ($log_2$) fold-changes: $-\Delta\Delta C_P$

Now we calculate the **relative** change in gene expression between sample groups.
For the sake of simplicity, we don't take into account other variables, e.g. confluence.
Also from the exploratory analysis you can see that these variables don't have much effect
on the separation of samples - sample group has the largest influence.

First, we calculate the $mean(C_P)$ of control samples:

```{r}
(control_means <- cp_long_target %>%
  dplyr::filter(sample_group == "control") %>%
  dplyr::group_by(gene) %>%
  dplyr::summarise(delta_cp_control_mean = mean(delta_cp, na.rm = TRUE))
)
```

Then we join this table with our long data and calculate:

- $\Delta \Delta C_P = \Delta C_{P, treatment} - \text{mean}(\Delta C_{P, control})$
- $log_2(\text{fold-change}) = -\Delta \Delta C_P$
- $\text{fold-change} = 2^{LFC}$

```{r}
cp_long_target <- cp_long_target %>%
  dplyr::left_join(control_means, by = "gene") %>%
  dplyr::mutate(
    delta_delta_cp = delta_cp - delta_cp_control_mean,
    lfc = -delta_delta_cp,
    fc = 2^lfc
  ) %>%
  dplyr::select(sample_name, gene, cp, cp_plot, delta_cp, delta_cp_plot, delta_cp_control_mean, delta_delta_cp, lfc, fc, everything())

head(cp_long_target)
```

$\text{mean}(\Delta \Delta C_{P, control})$ is now $0$, because we subtracted $\text{mean}(\Delta C_P, control)$
also from the control samples. Of course this also applies for $LFC$, which is just $-\Delta \Delta C_P$.
As you will see, this is useful for plotting (boxplots).

# Boxplots `r emo::ji("hammer_and_wrench")`

Boxplots are very informative when you want to see a difference between groups.

> Function to be implemented within the assignment.
  It should work for any type of long data, as you will be able to specify
  the grouping column to split boxplots on x-axe, and column with values to plot on y-axe.
  Tip: you can use [ggpubr::ggboxplot()](https://rpkgs.datanovia.com/ggpubr/reference/ggboxplot.html)

First we will plot the $\Delta C_P$ values:

```{r}
plot_boxplots(
    cp_long_target,
    x = "sample_group",
    y = "delta_cp",
    facet_by = "gene",
    color_by = "sample_group",
    main = "delta Cp values",
    y_lab = "delta Cp"
)
```

And this is tricky to interpret at the first sight, because in spirulina group, all genes are actually upregulated!
Also, you cannot directly see the $LFCs$ - in your head, you have to calculate the difference between spirulina and control
and flip the sign.

So better is to plot the $LFCs$. Why not the $\text{fold-changes}$?
Because $LFCs$ are easier to read, as changes (relative to controls) are multiplicative and symmetric:

<div id="div_lfc_table">

----------------------------  ---------  ---------  ---------  -----------  ---------  ---------  ---------
$\text{fold-change}$           0.125      0.25       0.5        1            2          4          8
$log_2(\text{fold-change})$    -3         -2         -1         0            1          2          3
You read: Change is...         8x less    4x less    2x less    no change    2x more    4x more    8x more  
----------------------------  ---------  ---------  ---------  -----------  ---------  ---------  ---------

</div>

This figure summarises how $LFCs$ were calculated and why it is better for boxplots `r emo::ji("nerd_face")`

![](`r here("E03-qPCR/_rmd_images/boxplot_lfc.png")`)

Now you can clearly see that spirulina group have upregulated genes:

```{r}
plot_boxplots(
    cp_long_target,
    x = "sample_group",
    y = "lfc",
    facet_by = "gene",
    color_by = "sample_group",
    main = "log2 fold-changes",
    y_lab = "log2 fold-change"
)
```

We can also split plots by confluence:

```{r}
plot_boxplots(
    cp_long_target,
    x = "sample_group",
    y = "lfc",
    facet_by = "gene",
    color_by = "confluence",
    main = "log2 fold-changes",
    y_lab = "log2 fold-change"
)
```

Alternatively to boxplots, we can make barplots, which are quite popular in papers `r emo::ji("slightly_smiling_face")`

```{r}
ggpubr::ggbarplot(
  cp_long_target,
  x = "sample_group",
  y = "fc",
  add = c("mean", "jitter"),
  fill = "sample_group",
  facet.by = "gene"
) +
  scale_y_continuous("Relative concentration", labels = scales::percent) +
  labs(x = NULL, fill = "Sample Group")
```

Or we can plot a heatmap of LFC values or their z-scores:

```{r}
lfc_wide <- cp_long_target %>%
  dplyr::select(sample_name, gene, lfc) %>%
  pivot_wider(names_from = sample_name, values_from = lfc) %>%
  column_to_rownames(var = "gene")
lfc_wide[, 1:5]

plot_heatmap(
  lfc_wide,
  column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
  title = "LFC between spirulina and control",
  legend_title = "LFC",
  cluster_rows = FALSE,
  cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)

plot_heatmap(
  lfc_wide,
  z_score = TRUE,
  column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
  title = "LFC between spirulina and control",
  legend_title = "LFC z-score",
  cluster_rows = FALSE,
  cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)
```

# t-test `r emo::ji("hammer_and_wrench")`

> Functions to be implemented within the assignment. You should implement both `test_gene()` and `test_gene_table()`.
  If you do it smart, you can just call the former from the latter.
  Hint: look at the `formula` parameter of `t.test()` function.

t-test is testing whether two means, coming from samples having Student's distribution, significantly differ.
t-test assumes that the input $C_P$ values are normally distributed and the variance between conditions is comparable.
Wilcoxon test can be used when sample size is small and those two last assumptions are hard to achieve.

As we will use two-sided alternative (i.e. gene is up- or downregulated = deregulated),
it doesn't matter which one of $\Delta C_P$, $\Delta \Delta C_P$ or $LFC$ we will use.
In all cases, the absolute difference between group means remains the same:

```{r}
group_means <- dplyr::filter(cp_long_target, gene == "CD24") %>%
  dplyr::group_by(sample_group) %>%
  dplyr::summarise(
    delta_cp_mean = mean(delta_cp),
    delta_delta_cp_mean = mean(delta_delta_cp),
    lfc_mean = mean(lfc)
  ) %>%
  as.data.frame() %>%
  tibble::column_to_rownames("sample_group")

group_means
apply(group_means, MARGIN = 2, FUN = function(x) abs(x[1] - x[2]))
```

```{r, message = TRUE}
for (g in target_genes) {
  test_gene(
    gene = g,
    gene_data = cp_long_target,
    gene_col = "gene",
    value_col = "lfc",
    group_col = "sample_group",
    test = t.test,
    verbose = TRUE
  )
}
```

But better is to return a table for all genes:

```{r}
test_table <- test_gene_table(
  gene_data = cp_long_target,
  gene_col = "gene",
  value_col = "delta_cp",
  group_col = "sample_group"
)
test_table
```

***

# Cleanup

```{r, warning = TRUE, message = TRUE}
save(list = ls(all.names = TRUE), file = here(BASE_DIR, "qPCR.RData"), envir = environment())

warnings()
traceback()
sessioninfo::session_info()
```

***
***

# HTML rendering

This chunk is not evaluated (`eval = FALSE`). Otherwise you will probably end up in recursive hell `r emo::ji("exploding_head")`

```{r, eval = FALSE, message = FALSE, warning = FALSE}
library(conflicted)
library(knitr)
library(glue)
library(here)

if (!require(rmdformats)) {
  BiocManager::install("rmdformats")
}

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = TRUE)
rmarkdown::render(
  here("E03-qPCR/qPCR.Rmd"),
  output_file = here("E03-qPCR/qPCR.html"),
  envir = new.env(),
  knit_root_dir = here()
)
```