- A crude pipeline for mitogenome recovery from low-coverage WGS using reference nuclear+mitogenome or just mitogenome
- This pipeline uses Geneious prime
- You need trimmed reads for running this pipeline
Checklist of softwares required
While recovering mitogenome from whole genome resequencing data we need to address potential NUMTs in our dataset. To do this we ideally require both nuclear and mitochondrial reference genomes for species of our interest or the next closest sister taxa. However, in most cases reference genomes for non-model species are not available (although this is changing rapidly). In such a case one can recover mitogenome using only mitochondrial reference of the target species or its closest sister, but this would lead to inclusion of NUMTs in the mitogenome assembly and any results of downstream analyses should be interpreted carefully. More or less the pipeline is similar for both the cases, and has been expanded upon below.
Although I have used geneious to generate a consensus and annotate, one can use any open source software to do the same.
Please note that Step 5 is common for both the case scenarios.
From NCBI or other online sources, download the nuclear and mitochondrial reference genome of your species of interest. In case no reference is available we can use a close sister species to our taxa. Merge the nuclear reference and mitochondrial reference. This will allow us to reduce the proportion of NUMTs in our mitochondrial assembly downstream.
cat nuclear_ref.fna mito_ref.fasta > final_ref.fna
Now we index the reference using bwa
bwa index ~/path/to/final_ref.fna
Step 2 - Map the trimmed reads to the indexed reference (BWA and samtools) and extract only reads mapping to the mitogenome region
Now we will map the trimmed reads of our data to the indexed reference we generated in the previous step. To do that we will use bwa.
bwa mem -t 20 -M -R "@RG\tID:\tSM:\tLB:IlluminaWGS\tPL:ILLUMINA" \ #here I have used -t flag to specify cores, -R to add read-group to bam, do not forget to add details to the bam-header
~/path/to/indexed/final_ref.fna \
~/path/to/R1.fq.gz \
~/path/to/R2.fq.gz | \
samtools view -bh - | \
samtools sort -@20 -T tmp -o output_mapped_sorted.bam
We will now extract the reads mapping to only the mitochondrial region. To do that we will first check the header of our mitochondrial reference fasta to know what to look for in the generated bam file.
head /path/to/mito_ref.fasta
>NC_042191.1 Culicicapa ceylonensis mitochondrion, complete genome
GTCCCTGTAGCTTACAAAAAGCATGACACTGAAGATGTCAAGACGGCTGCCATATGCACCCAAGGACAAA
AGACTTAGTCCTAACCTTACTGTTGGTTTTTGCTAGACATATACATGCAAGTATCCGCGCGCCAGTGTAG
The head command will print out the first few lines of the fasta file. Note down the accession number after > (in this case NC_042191.1). We are now going to use samtools to extract the reads mapping to this region.
samtools view output_mapped_sorted.bam "NC_042191.1" > output_mitomapped_sorted.bam
We now have the bam file consisting of only reads mapped to the mitogenome region. Do note that the file size reduces considerably as < 98% of the reads will map to the mitogenome.
Using qualimap bamqc we will assess the mapping coverage and quality
qualimap bamqc -bam output_mitomapped_sorted.bam -outfile results.pdf
Inspect the output to understand the coverage and quality of the reads mapped to the mitogenome region. We can now move to Step 5 to generate a consensus mitogenome and annotate.
From NCBI or other online sources, download the mitochondrial reference genome of your species of interest. In case no reference is available we can use a close sister species to our taxa.
Now we index the reference using bwa
bwa index ~/path/to/mito_ref.fasta
Now we will map the trimmed reads of our data to the indexed reference we generated in the previous step. To do that we will use bwa.
bwa mem -t 20 -M -R "@RG\tID:\tSM:\tLB:IlluminaWGS\tPL:ILLUMINA" \ #here I have used -t flag to specify cores, -R to add read-group to bam, do not forget to add details to the bam-header
~/path/to/indexed/mito_ref.fasta/ \
~/path/to/R1.fq.gz \
~/path/to/R2.fq.gz | \
samtools view -bh - | \
samtools sort -@20 -T tmp -o output_mapped_sorted.bam
Using qualimap bamqc we will assess the mapping coverage and quality
qualimap bamqc -bam output_mapped_sorted.bam -outfile results.pdf
Inspect the output to understand the coverage and quality of the reads mapped to the reference mitogenome. Note that < 98% of our reads have mapped.
Now we will extract only the reads mapped to the reference mitogenome. To do this we will use samtools
samtools view -b -F 4 -o output_mitomapped_sorted.bam output_mapped_sorted.bam
We are now ready to proceed to the Step 5 to generate consensus mitogenome and annotate.
Import the final bam file into geneious using File > Import > Files
Download the same reference mitogenome (only mitogenome reference) that was used in generating the bam file using NCBI database in the geneious software.
You can now visualize where all the reads map on the reference mitogenome.
Once you get a good idea, you can generate a consensus sequence. To do this select Tools > Generate Consensus Seqeunce
After generating the consensus sequence, transfer the annotations from the reference mitogenome sequence by checking the checkbox under Annotate and Predict tab.
Make sure to use a similarity of 40-50% to ensure that annotations are transfered from your reference to the consensus sequence.
In case we are interested in circularizing the mitogenome we can do it using Sequence > Circularize Sequence
This will give us the final circularized annotated mitogenome which has beene recovered from the whole-genome dataset