Skip to content

Latest commit

 

History

History
102 lines (66 loc) · 3.8 KB

03_ASSEMBLY_TUTORIAL.md

File metadata and controls

102 lines (66 loc) · 3.8 KB

Basic Assembly Tutorial

  1. Start a Droplet with at least 1GB of memory in Digital Ocean and ssh into it. Alternatively (and better), install and use boot2docker on your local machine. (See the previous tutorial about creating and securing your own Digital Ocean droplet.)

  2. Download the Docker bwawrik/bioinformatics:latest

    docker pull bwawrik/bioinformatics:latest
    
  3. Make a data directory

    mkdir /data
  4. Start the docker and mount the /data directory. (See the previous tutorial about Docker).

    docker run -t -i -v /data:/data bwawrik/bioinformatics:latest
  5. Change your directory to /data

    cd /data
  6. Download the sample genome data set

    wget https://github.com/bwawrik/MBIO5810/raw/master/sequence_data/232_R1_40k.fastq.gz
    wget https://github.com/bwawrik/MBIO5810/raw/master/sequence_data/232_R2_40k.fastq.gz
  7. unzip the data files

    gunzip *.gz

note: These two files represent forward and reverse reads of MySeq genome sequencing run. They are partial files to allow the assembly to complete in a reasonable amount of time. Together the files contain about 5*10^6 bp of sequence, which is about 1x coverage on the genome of SPR.

Ray Assembly

Brief description of Ray:

Ray is a parallel software that computes de novo genome assemblies with next-generation sequencing data. Ray is written in C++ and can run in parallel on numerous interconnected computers using the message-passing interface (MPI) standard.

  1. Run a Ray assembly with a k-mer setting of 31 as follows

    Ray -k31 -p 232_R1_40k.fastq 232_R2_40k.fastq -o ray_31/
  2. If you wanted to do this with multiple cores do this (e.g. with six cores): (This won't work yet. It requires Open MPI to work on the Docker, which it does not yet do.)

    mpiexec -n 6 Ray -k31 -p output_forward_paired.fastq output_reverse_paired.fastq -o ray_31/

Velvet Assembly

Brief description of Velvet:

Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom. Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.

  1. Let's try a Velvet assembly.

    velveth velvet/ 31 -shortPaired -fastq -separate 232_R1_40k.fastq 232_R2_40k.fastq
    velvetg velvet/
  2. Download the N50 perl script

    wget https://github.com/bwawrik/MBIO5810/raw/master/perl_scripts/N50.pl
  3. Then assess the N50 stats on both assemblies.

    perl N50.pl velvet/contigs.fa
    perl N50.pl ray_31/Contigs.fasta

Predict protein coding genes from both

prodigal -d temp.orfs.ray.fna -a temp.orfs.ray.faa -i ray31/Contigs.fasta -m -o Ray_temp.txt -p meta -q
prodigal -d temp.orfs.velvet.fna -a temp.orfs.velvet.faa -i velvet/contigs.fa -m -o velvet_temp.txt -p meta -q
cut -f1 -d \" \ temp.orfs.ray.faa > orfs.ray.faa
cut -f1 -d \" \ temp.orfs.velvet.faa > orfs.velvet.faa

Self-Examination

Which assembly is faster ? Which assembly is better ? Why ?