Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

Modules

Gabriele Girelli edited this page Jan 10, 2018 · 4 revisions

1. Quality Control

The Quality Control (QC) module runs fastqc on the input fastq files. A quality report (in html format) is generated in the aux output subfolder. More details on how to interpret the report is available here.

2. File Generation

The file generation module uncompresses the input fastq files and generate both fastq and fasta files both with normal format and with one fragment per line.

3. Pattern Filter

The pattern filtering module runs scan_for_matches to identify those reads in the library that contain the pattern specified in the patterns.tsv file.

Following scan_for_matches syntax, it is fairly simple to specify portions of the fragment of un/known length and sequence, allowing also for mismatches, insertion and/or deletions if needed. More details on the syntax, with examples, are available here.

The module also updates the main summary output with information on the filter results and used patterns.

4. Alignment

The alignment module starts by trimming the non-genomic part of the reads, and then performs single-/paired-ends alignment with either bwa or bowtie2 (depending on user selection). Then, it uses samtools to generate bam and bai files and to update the main summary output with some general and preliminary information on the alignment. It finishes by adding the trimmed linkers back to the sam file generated by the alignment.

5. Alignment filter

The alignment filter module removes any secondary alignment, chimeras and R2 (in case of pair-end sequencing), unmapped reads, low-quality alignments and reads aligned to absent chromosomes (based on user choice). Finally, it resets the position of the alignments to the 5'of the + strand of the cutsite. Also, it updates the main summary output with more punctual alignment information.

6. UMI analysis

The UMI analysis module performs four operations:

  • Group reads that fall exactly on the same genomic coordinate.
  • Assign UMIs to the closest cutsite allowing a maximum distance, further reads are considered orphans and discarded. If no cusite list is provided this step is skipped.
  • Removes UMI with low reading quality and performs a strict de-duplication.
  • Generate bed files with number of de-duplicated UMIs per genomic location (i.e., cutsite).

7. Library complexity estimation

Library complexity is estimated useing the preseq package. Additional information on how to properly setup preseq are available in INSTALL.md.