Basic Assembly Tutorial

Start a Droplet with at least 1GB of memory in Digital Ocean and ssh into it. Alternatively (and better), install and use boot2docker on your local machine. (See the previous tutorial about creating and securing your own Digital Ocean droplet.)
Download the Docker bwawrik/bioinformatics:latest
```
docker pull bwawrik/bioinformatics:latest
```
Make a data directory
```
mkdir /data
```
Start the docker and mount the /data directory. (See the previous tutorial about Docker).
```
docker run -t -i -v /data:/data bwawrik/bioinformatics:latest
```
Change your directory to /data
```
cd /data
```

Download the sample genome data set

wget https://github.com/bwawrik/MBIO5810/raw/master/sequence_data/232_R1_40k.fastq.gz
wget https://github.com/bwawrik/MBIO5810/raw/master/sequence_data/232_R2_40k.fastq.gz

unzip the data files
```
gunzip *.gz
```

note: These two files represent forward and reverse reads of MySeq genome sequencing run. They are partial files to allow the assembly to complete in a reasonable amount of time. Together the files contain about 5*10^6 bp of sequence, which is about 1x coverage on the genome of SPR.

Ray Assembly

Brief description of Ray:

Ray is a parallel software that computes de novo genome assemblies with next-generation sequencing data. Ray is written in C++ and can run in parallel on numerous interconnected computers using the message-passing interface (MPI) standard.

Run a Ray assembly with a k-mer setting of 31 as follows

Ray -k31 -p 232_R1_40k.fastq 232_R2_40k.fastq -o ray_31/

If you wanted to do this with multiple cores do this (e.g. with six cores): (This won't work yet. It requires Open MPI to work on the Docker, which it does not yet do.)
```
mpiexec -n 6 Ray -k31 -p output_forward_paired.fastq output_reverse_paired.fastq -o ray_31/
```

Velvet Assembly

Brief description of Velvet:

Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom. Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

Let's try a Velvet assembly.

velveth velvet/ 31 -shortPaired -fastq -separate 232_R1_40k.fastq 232_R2_40k.fastq
velvetg velvet/

Download the N50 perl script

wget https://github.com/bwawrik/MBIO5810/raw/master/perl_scripts/N50.pl

Then assess the N50 stats on both assemblies.

perl N50.pl velvet/contigs.fa
perl N50.pl ray_31/Contigs.fasta

Predict protein coding genes from both

prodigal -d temp.orfs.ray.fna -a temp.orfs.ray.faa -i ray31/Contigs.fasta -m -o Ray_temp.txt -p meta -q
prodigal -d temp.orfs.velvet.fna -a temp.orfs.velvet.faa -i velvet/contigs.fa -m -o velvet_temp.txt -p meta -q
cut -f1 -d \" \ temp.orfs.ray.faa > orfs.ray.faa
cut -f1 -d \" \ temp.orfs.velvet.faa > orfs.velvet.faa

Self-Examination

Which assembly is faster ? Which assembly is better ? Why ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03_ASSEMBLY_TUTORIAL.md

03_ASSEMBLY_TUTORIAL.md

Basic Assembly Tutorial

Ray Assembly

Velvet Assembly

Predict protein coding genes from both

Self-Examination

Files

03_ASSEMBLY_TUTORIAL.md

Latest commit

History

03_ASSEMBLY_TUTORIAL.md

File metadata and controls

Basic Assembly Tutorial

Ray Assembly

Velvet Assembly

Predict protein coding genes from both

Self-Examination