-
Notifications
You must be signed in to change notification settings - Fork 2
SAMtools workflow
the linux image does not come with wgsim, the following code snippet will use git to clone the wgsim repository, and then compile the code (instructions are in the file README), and install the wgsim binary executible using sudo;
cd ~/Downloads/ git clone https://github.com/lh3/wgsim.git cd wgsim gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm #compile echo $PATH ##check your path ## Password is visg_user sudo cp wgsim /usr/local/sbin
Make sure it works...
## Sanity check; wgsim
- Workshop code
- Samtools tutorial
- Piping with samtools
- SAMtool bitwise flag meaning explained
- explain flags
- BWA documentation
Typing samtools in the shell terminal invokes basic help, samtools expects a verb <command></command> with [options] to do various tasks, typing in a verb also invokes basic help e.g. samtools view
Go to the samtools workflow directory;
cd $HOME/VISG-course-2012/supplementary_samtools_usage
Executing the following bash script simulates a paired end illumina dataset using wgsim (available in SAMtools), and runs the bwa aligner to create a paired end sam file.
$ ./init.sh
- Sanity check;
$ ls -l | grep \\.sam$ -rw-r--r-- 1 visg_user visg 1248911 18 Oct 16:55 aln.sam
The unix command file describes a filetype;
echo "[Type]" >&2 file aln.sam
Storing sequence alignment maps as binary bam files is efficient in terms of compression. Most samtools <commands></commands> operate on bam files;
qual=20 echo "[sam]" >&2 samtools view -q $qual -bS -o aln.bam aln.sam
the filetype of this file should be a binary format;
echo "[Type]" >&2 file aln.bam
The optional argument samtool view -H outputs the header information of a bam file;
echo "[Printing]" >&2 samtools view -H aln.bam
Using samtools view -S to view a sam file, we are piping the output perl and awk , in this example to extract only the second column;
echo "[Examining]" >&2 samtools view -S aln.sam | perl -lane 'print @F[1]' | head -n 10 echo "[Examining]" >&2 samtools view -S aln.sam | awk '{print $2}' | head -n 10
Repeating above to extract a specific column from a bam file;
echo "[Examining]" >&2 samtools view aln.bam | perl -lane 'print @F[1]' | head -n 10 echo "[Examining]" >&2 samtools view aln.bam | awk '{print $2}' | head -n 10
Using flow control, a loop is iterating the variable $i from 1 to 11, and print the command required to extract column $i. Indenting is used inside the loop for code readability;
for i in {1..11}; do echo "[Command]" >&2 echo "samtools view -S aln.sam | awk '{print \$$i}' | head -n 10" >&2 done
Executing samtools sort sorts the reads by geographic location based on alignment position (column 4 pos). Many downstream visualization programs only operate on sorted bam files, also required for indexing.
echo "[sorting]" >&2 samtools sort aln.bam aln_sorted
Current there should be no index file in existence which has a .bai extension. An index file is required to index alignment reads to all references which allows post processing filtering on specific references by region of interest.
echo "[list]" >&2 ls -l | grep "\.bai$"
Create an index using samtools index;
echo "[create]" >&2 samtools index aln_sorted.bam
Any sorted bam file that is indexed should now contain the extension .bam.bai;
echo "[list]" >&2 ls -l | grep "\.bai$"
The statistics from samtools idxstats output alignment index statistics;
echo "[bam]" >&2 echo "refname length mapped" >&2 samtools idxstats aln_sorted.bam
These are of the form;
echo "[simple]" >&2 samtools flagstat aln_sorted.bam
echo "[Count]" >&2 samtools view -c aln_sorted.bam
This is a text based viewer, useful for a quick snapshot view of sorted bam files.
echo "[Running]" >&2 sleep 5 samtools tview aln_sorted.bam ../05.reference/pool1.fasta
Some third party tools only work with sam files, samtools view -o [file.sam] writes output to a sam file;
samtools view -o aln_sorted.sam aln_sorted.bam
This is an example merging multiple bam files derived from single end mapping of the roche 454 dataset into a sinlge bam file.
samtools merge pooled1.bam ../11.Samtools_Filter_Pool1/Pool1_Pop[1-8].filtered.bam
In order to filter, we need know what reference sequence and sequence range to filter on, we can retrieve useful information from the header of the bam file;
echo "[Printing]" >&2 samtools view -H aln.bam
Providing the optional argument [region1] = CO_Pool1_contig00004:500-1000, samtools will filter to STDOUT only those sequences on a sorted indexed bam file
echo "[Filter]" >&2 samtools view aln_sorted.bam CO_Pool1_contig00004:500-1000 | head -n 10
Piping the output towc -l will count of sequences in the region, or just use samtools view -c;
echo "[How]" >&2 samtools view aln_sorted.bam CO_Pool1_contig00004:500-1000 | wc -l echo "[using]" >&2 samtools view -c aln_sorted.bam CO_Pool1_contig00004:500-1000
Piping multiple samtools commands saves on disk space by streaming the output of a samtools command as the input of another in a sequential chain, however if an error occurs it more is difficult to trace where exactly the error is;
echo "[SAM]" >&2 samtools view -bS aln.sam | samtools sort - aln_sorted
The - symbol in samtools it to tell the samtools program to take the input from pipe