Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate main package vignette #2

Merged
merged 4 commits into from
Apr 19, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 170 additions & 2 deletions vignettes/MotifStats.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,181 @@ vignette: >
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
```{r include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r setup}

## Introduction

`MotifStats` is a simple R package to calculate the the metrics to quantify the
relationship between peaks and motifs. It is based on [Analysis of Motif
Enrichment (AME)](https://meme-suite.org/meme/doc/ame.html) and [Find Individual
Motif Occurrences (FIMO)](https://meme-suite.org/meme/doc/fimo.html) from the
[MEME suite](https://meme-suite.org/meme/index.html).

<br>
It has two distinct functions:

1. Calculate motif enrichment motif enrichment relative to a set of background
sequences using AME.
2. Calculate the distance between each motif and its nearest peak summit, where
each motif is identified using FIMO.

## Data
The `MotifStats` package comes with motif and peak data for transcription
factors CTCF and CREB1. Details of the files are as follows:

- **CREB1 TIP-seq peaks (narrowPeaks)**[^f1]
-- *Data not publicly available*

- **CTCF TIP-seq peaks (narrowPeaks)**[^f1]
-- Retrieved from NCBI GEO under
accession
[GSE188512](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE188512)

- **CREB motif**
-- Retrieved from JASPAR2024 under matrix ID
[MA0018.5](http://jaspar.genereg.net/matrix/MA0018.5/)

- **CTCF motif**
-- Retrieved from JASPAR2024 under matrix ID
[MA1930.2](http://jaspar.genereg.net/matrix/MA1930.2/)


## Installation
`MotifMarker` relies on [MEME suite](https://meme-suite.org/meme/index.html) as
a system dependency. Directions for installation can be found [here](https://www.bioconductor.org/packages/release/bioc/vignettes/memes/inst/doc/install_guide.html).
<br>
To install the package, use the following command:
```{r eval = FALSE}
if(!require("remotes")) install.packages("remotes")
remotes::install_github("neurogenomics/MotifStats")
```


## Usage
In this example analysis, we will compare the enrichment of the CREB motif in
CTCF TIP-seq peaks relative to the background. We will also calculate the
distance between the centre of each CTCF motif and its nearest peak summit.

### Load packages
Load the installed package.
```{r include = FALSE}
library(MotifStats)
```

For this example, we will also load the `BSgenome.Hsapiens.UCSC.hg38` package to
provide the genome build our peaks have been derived from.
```{r include = FALSE}
library(BSgenome.Hsapiens.UCSC.hg38)
```

### Prepare input data
First, we need to load the motif and peak data. The motif data obtained from
JASPAR can be loaded as follows:
```{r eval = FALSE}
name_motif <- read_motif_file(
"/path/to/motfif/file.jaspar",
motif_id = "id_of_motif",
file_format = "jaspar"
)
```

Next, we load the narrowPeak files from MACS2/3.
```{r eval = FALSE}
name_peaks <- read_peak_file("/path/to/peak/file.narrowPeak")
```

For this example, we will load the built-in CREB1 TIP-seq peaks (as `GRanges`
object) and CREB motif (as `PWMatrix` object) data.
```{r include = TRUE}
data("ctcf_motif")
data("ctcf_peaks")
```

### Calculate motif enrichment
To calculate the motif relative to a set of background sequences, we use
`peak_proportion()`.

- An additional `out_dir` argument can be used to specify the
directory to save the AME output files[^f2].

```{r include = TRUE}
ctcf_read_prop <- peak_proportion(
peak_input = ctcf_peaks,
motif = ctcf_motif,
genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
)
```

`ctcf_read_prob` is a list with two components:
1. `$tp` (True positives) with the number of true positive motif occurrences in
the given sequence followed by its relative percentage.
1. `$fp` (False positives) with the number of false positive motif occurrences
in the given sequence followed by its relative percentage.


### Find motif-summit distances
To calculate the distance between each motif and its nearest peak summit, we use
`summit_to_motif()`.

- `fp_rate` argument specifies the desired false-positive rate for FIMO. A
p-value is calculated using the formula:
$$p = \frac{\text{fp_rate}}{2 * \text{promoter_length}}$$
- An additional `out_dir` argument can be used to specify the
directory to save the 0-order background file.

```{r include = TRUE}
ctcf_read_sum_motif <- summit_to_motif(
peak_input = ctcf_peaks,
motif = ctcf_motif,
fp_rate = 0.05,
genome_build = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
)
```

`ctcf_read_sum_motif` is a list with two objects:
1. `peak_set` with peak information.
2. `distance_to_summit` with distances between the centre of each motif and its
nearest peak summit.


#### Visualize results
We can optionally visualize the distribution of distances by using `density_plot()`.
```{r include = TRUE, fig.width = 7, fig.height = 4}
density_plot(
ctcf_read_sum_motif$distance_to_summit,
plot_title = "CTCF motif distance to summit",
x_label = "Distance to summit (bp)",
y_label = "Density"
)
```


> **NOTE:** Since AME and FIMO accept different parameters and are calculated
independently, it is not possible to obtain directly comparable results.


## Future Enchancements
- Calculate metrics for more than one motif at a time.


## Session Info

<details>

```{r echo = FALSE}
sessionInfo()
```

</details>

<!-- Footnotes -->
[^f1]: The peak file is a subset of chromosome 19 to reduce the size of the
built-in data.
[^f2]: For more information on the output files, refer to the
[AME documentation](https://meme-suite.org/meme/doc/ame.html).
Loading