Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#20 clean multiqc output to avoid duplicated name in the general stat… #22

Merged
merged 3 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ AliNe is a pipeline written in Nextflow that aims to efficiently align reads aga
* [Usage and test](#usage)
* [Parameters](#parameters)
* [Output](#output)
* [Structure](#structure)
* [Statistics](#statistics)
* [Contributing](#contributing)

## Foreword
Expand Down Expand Up @@ -353,6 +355,9 @@ On success you should get a message looking like this:

## Output


### Structure

Here the description of typical ouput you will get from AliNe:

```
Expand Down Expand Up @@ -404,7 +409,87 @@ Here the description of typical ouput you will get from AliNe:
└── MultiQC # MultiQC folder that aggregate results across many samples into a single report
├── multiqc_report.html # Report with interactive plots for statistics across many samples.
└── multiqc_report_data # Plot and data used by the multiqc_report.html
```

### Statistics

In order to compare several aligner output you should activate the `--fastqc` parameter. AliNe will run the FastQC program for each output file, thereafter all FastQC file will be gathered in an html file using MultiQC. The resulting file called `multiqc_report.html` can be found in <outdir>/MultiQC ( <outdir> by default is called `alignment_results` and can be setup using `--outdir` AliNe parameter).

FastQC collect the following information:
* Sequence Counts
* Sequence Quality
* Per Sequence Quality Scores
* Per Base Sequence Content
* Per Sequence GC Content
* Per Base N Content
* Sequence Length Distribution
* Sequence Duplication Levels
* Overrepresented sequences
* Adapter Content

#### General Statistics

Sequence Duplication Levels, Per Sequence GC Content and Sequence Counts are reported at the top of the `multiqc_report.html` file in a table called `General Statistics` as % Dups, %GC a,d M Seqs accordingly (see below).

<img src="img/multiqc.png" />

In order to facilitate the reading of this `General Statistics` you can export the table in tsv using the `Export as CSV...` button and execute the following piece of R code on the downloaded `general_stats_table.tsv` file :

```R
# install packages
install.packages("dplyr")
install.packages("stringr")
install.packages("tidyr")

# Load necessary libraries
library(dplyr)
library(stringr)
library(tidyr)

# Read the TSV file
file_path <- "general_stats_table.tsv"
df <- read.delim(file_path, check.names = FALSE)

# sample name as row name
rownames(df) <- df$Sample

# remove Sample column and clean up the column names
tmp <- cbind(ID = rownames(df), stack(df[-1])) |>
transform(ind = as.character(ind) |> stringr::str_remove_all("\\.\\d+"))

# remove na values
tmp <- tmp[!is.na(tmp$values),]
# remove . values
tmp$values <- tmp$values |> stringr::str_remove_all("^\\.$")

# pivot data
tmp <- tmp |> pivot_wider(id_cols = ID , names_from = ind, values_from = values,
values_fn = \(x) paste(unique(x), collapse = ""))

# print only 4 first columns
tmp[1:4]
```

You will get a table similar to this one:

```
ID Dups GC Seqs
<chr> <chr> <chr> <chr>
1 yeast_R1 73.01 43 0.01
2 yeast_R1_seqkit_trim 69.21661409043114 44 0.007607999999999999
3 yeast_R1_seqkit_trim_STAR_sorted 69.54090258811289 44 0.007689
4 yeast_R1_seqkit_trim_bbmap_sorted 69.21661409043114 44 0.007607999999999999
5 yeast_R1_seqkit_trim_bowtie2_sorted 69.21661409043114 44 0.007607999999999999
6 yeast_R1_seqkit_trim_bwaaln_sorted 69.21661409043114 44 0.007607999999999999
7 yeast_R1_seqkit_trim_bwamem_sorted 69.18933123111286 44 0.007611
8 yeast_R1_seqkit_trim_bwasw_sorted 69.23279033105622 44 0.007612
9 yeast_R1_seqkit_trim_graphmap2_sorted 69.21661409043114 44 0.007607999999999999
10 yeast_R1_seqkit_trim_hisat2_sorted 69.21661409043114 44 0.007607999999999999
11 yeast_R1_seqkit_trim_kallisto_sorted 69.21661409043114 44 0.007607999999999999
12 yeast_R1_seqkit_trim_minimap2_sorted 69.63058976020739 44 0.007715
13 yeast_R1_seqkit_trim_nucmer.fixed_sorted 64.70588235294117 42 0.00016999999999999999
14 yeast_R1_seqkit_trim_sublong 53.23343848580442 44 0.007607999999999999
15 yeast_R1_seqkit_trim_subread 69.21661409043114 44 0.007607999999999999
```


Expand Down
3 changes: 2 additions & 1 deletion aline.nf
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,7 @@ workflow align {
// ------------------- QC -----------------
if(params.fastqc){
fastqc_raw(raw_reads, "fastqc/raw", "raw")
logs.mix(fastqc_raw.out)
logs.concat(fastqc_raw.out).set{logs} // save log
}

// ------------------- Standardize Score to Be Phred+33 ----------------
Expand Down Expand Up @@ -1092,4 +1092,5 @@ In aline.nf
Think to convert sam to bam if necessary
Think to sort bam output if necessary
Add info in README.md
Add tool in multiqc config file (at least for fastqc)
*/
16 changes: 13 additions & 3 deletions config/multiqc_conf.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
title: "Nextflow Alignment pipeline"
title: "AliNe - Nextflow Alignment pipeline (https://github.com/Juke34/AliNe)"

run_modules:
- fastqc
Expand Down Expand Up @@ -45,6 +45,10 @@ module_order:
name: "FastQC (graphmap2)"
path_filters:
- "*graphmap2_logs*"
- fastqc:
name: "FastQC (hisat)"
path_filters:
- "*hisat_logs*"
- hisat2
- fastqc:
name: "FastQC (hisat2)"
Expand Down Expand Up @@ -73,6 +77,12 @@ module_order:
- "*sublong_logs*"
- star
- fastqc:
name: "FastQC (star)"
name: "FastQC (STAR)"
path_filters:
- "*star_logs*"
- "*STAR_logs*"
- fastqc:
name: "FastQC (STARlong)"
path_filters:
- "*star_logs*"
- "*starlong_logs*"
- "*STARlong_logs*"
3 changes: 2 additions & 1 deletion config/softwares.config
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ process {
container = 'quay.io/biocontainers/minimap2:2.26--he4a0461_1'
}
withName: 'multiqc' {
container = 'quay.io/biocontainers/multiqc:1.14--pyhdfd78af_0'
//container = 'quay.io/biocontainers/multiqc:1.14--pyhdfd78af_0'
container = 'quay.io/biocontainers/multiqc:1.27--pyhdfd78af_0'
}
withLabel: 'mummer4' {
container = 'quay.io/biocontainers/mummer4:4.0.0rc1--pl5321hdbdd923_6'
Expand Down
Binary file added img/multiqc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion modules/bbmap.nf
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ process bbmap {
}

// set fileName
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_bbmap"
"""
${tool} \\
ref=${genome_index} \\
Expand Down
13 changes: 8 additions & 5 deletions modules/bowtie2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ process bowtie2 {

output:
tuple val(sample), path ("*.sam"), emit: tuple_sample_sam
path "*bowtie2.log", emit: bowtie2_summary
path "*.log", emit: bowtie2_summary

script:

Expand All @@ -48,22 +48,25 @@ process bowtie2 {
read_orientation = "--ff"
}
}

// catch filename
filename = reads[0].baseName.replace('.fastq','') + "_bowtie2"

if (params.read_type == "short_paired"){
"""
bowtie2 ${params.bowtie2_options} ${read_orientation}\\
-p ${task.cpus} \\
-x ${genome.baseName} \\
-S ${reads[0].baseName.replace('.fastq','')}_bowtie2.sam \\
-1 ${reads[0]} -2 ${reads[1]} 2> ${reads[0].baseName.replace('.fastq','')}_bowtie2.log
-S ${filename}.sam \\
-1 ${reads[0]} -2 ${reads[1]} 2> ${filename}_sorted.log
"""
} else {
"""
bowtie2 ${params.bowtie2_options} ${read_orientation}\\
-p ${task.cpus} \\
-x ${genome.baseName} \\
-S ${reads.baseName.replace('.fastq','')}_bowtie2.sam \\
-U ${reads} 2> ${reads.baseName}_bowtie2.log
-S ${filename}.sam \\
-U ${reads} 2> ${filename}_sorted.log
"""
}

Expand Down
14 changes: 7 additions & 7 deletions modules/hisat2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ process hisat2 {

output:
tuple val(sample), path ("*.sam"), emit: tuple_sample_sam
path "${sample}_splicesite.txt"
path "*hisat2-summary.txt", emit: hisat2_summary
path "*_splicesite.txt"
path "*_summary.txt", emit: hisat2_summary

script:

// catch index basename
index_basename = hisat2_index_files[0].toString() - ~/.\d.ht2l?/

// catch filename
filename = reads[0].baseName.replace('.fastq','')
filename = reads[0].baseName.replace('.fastq','') + "_hisat2"

// deal with library type - default is unstranded.
def lib_strand=""
Expand Down Expand Up @@ -82,14 +82,14 @@ process hisat2 {

if (params.read_type == "short_paired"){
"""
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${sample}_splicesite.txt \\
--new-summary --summary-file ${sample}.hisat2-summary.txt \\
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${filename}_splicesite.txt \\
--new-summary --summary-file ${filename}_sorted_summary.txt \\
-p ${task.cpus} -x $index_basename -1 ${reads[0]} -2 ${reads[1]} > ${filename}.sam
"""
} else {
"""
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${sample}_splicesite.txt \\
--new-summary --summary-file ${sample}.hisat2-summary.txt \\
hisat2 ${lib_strand} ${read_orientation} ${params.hisat2_options} --novel-splicesite-outfile ${filename}_splicesite.txt \\
--new-summary --summary-file ${filename}_sorted_summary.txt \\
-p ${task.cpus} -x $index_basename -U $reads > ${filename}.sam
"""
}
Expand Down
21 changes: 14 additions & 7 deletions modules/kallisto.nf
Original file line number Diff line number Diff line change
Expand Up @@ -26,21 +26,21 @@ process kallisto_index {
process kallisto {
label 'kallisto'
tag "$sample"
publishDir "${params.outdir}/${outpath}", pattern: "*.bam", mode: 'copy'
publishDir "${params.outdir}/${outpath}", pattern: "${filename}/*.bam", mode: 'copy'

input:
tuple val(sample), path(reads), val(library), val(read_length)
path kallisto_index
val outpath

output:
tuple val(sample), path ("*.bam"), emit: tuple_sample_bam
path "*kallisto.log", emit: kallisto_summary
tuple val(sample), path ("${filename}/*.bam"), emit: tuple_sample_bam
path "*.log", emit: kallisto_summary

script:

// catch filename
filename = reads[0].baseName.replace('.fastq','')
filename = reads[0].baseName.replace('.fastq','') + "_kallisto_sorted"

// deal with library type
def read_orientation=""
Expand All @@ -61,7 +61,11 @@ process kallisto {
-t ${task.cpus} \
--pseudobam \
-i ${kallisto_index} \
${reads[0]} ${reads[1]} -o ${filename}.bam 2> ${filename}_kallisto.log
${reads[0]} ${reads[1]} -o ${filename} 2> ${filename}.log

mv ${filename}/pseudoalignments.bam ${filename}/${filename}.bam
# in order that the log file contains the name of the output fastq files (MultiQC)
sed -i 's/${reads[0]}/${filename}.fastq.gz/' ${filename}.log
"""
} else {

Expand All @@ -83,8 +87,11 @@ process kallisto {
--pseudobam \
-i ${kallisto_index} \
--single \
${reads} -o ${filename}.bam 2> ${filename}_kallisto.log
${reads} -o ${filename} 2> ${filename}.log

mv ${filename}/pseudoalignments.bam ${filename}/${filename}.bam
# in order that the log file contains the name of the output fastq files (MultiQC)
sed -i 's/${reads[0]}/${filename}.fastq.gz/' ${filename}.log
"""
}

}
8 changes: 4 additions & 4 deletions modules/star.nf
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ process star {
--readFilesIn pipedRead1 pipedRead2 \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')} \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_${star_tool}_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate
"""
Expand All @@ -121,7 +121,7 @@ process star {
--readFilesIn pipedRead \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')} \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_${star_tool}_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate
"""
Expand Down Expand Up @@ -172,7 +172,7 @@ process star2pass{
--readFilesIn pipedRead1 pipedRead2 \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_2pass \\
--outFileNamePrefix ${reads.baseName.replace('.fastq','')}_${star_tool}_2pass_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate \\
--sjdbFileChrStartEnd *SJ.out.tab
Expand All @@ -195,7 +195,7 @@ process star2pass{
--readFilesIn pipedRead \\
--runThreadN ${task.cpus} \\
--runMode alignReads \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_2pass \\
--outFileNamePrefix ${reads[0].baseName.replace('.fastq','')}_${star_tool}_2pass_sorted \\
--outSAMunmapped Within \\
--outSAMtype BAM SortedByCoordinate \\
--sjdbFileChrStartEnd *SJ.out.tab
Expand Down
4 changes: 2 additions & 2 deletions modules/subread.nf
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ process subread {
}

// remove fastq.gz
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_subread"

// prepare index name
def index_prefix = genome.baseName + "_index"
Expand Down Expand Up @@ -120,7 +120,7 @@ process sublong {
script:

// remove fastq.gz
def fileName = fastq[0].baseName.replace('.fastq','')
def fileName = fastq[0].baseName.replace('.fastq','') + "_sublong"

// prepare index name
def index_prefix = genome.baseName + "_index"
Expand Down
Loading