Skip to content
mdavy86 edited this page Dec 6, 2012 · 22 revisions

Table of Contents

Prerequisites

the linux image does not come with wgsim, the following code snippet will use git to clone the wgsim repository, and then compile the code (instructions are in the file README), and install the wgsim binary executible using sudo;

cd ~/Downloads/
git clone https://github.com/lh3/wgsim.git
cd wgsim

gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm

#compile
echo $PATH ##check your path

## Password is visg_user
sudo cp wgsim /usr/local/sbin 

Make sure it works...

## Sanity check;
wgsim  

Resources

Documentation

BWA documentation

Command line usage

Typing samtools in the shell terminal invokes basic help, samtools expects a verb <command></command> with [options] to do various tasks, typing in a verb also invokes basic help e.g. samtools view

SAMtools workflow

Go to the samtools workflow directory;

 cd $HOME/VISG-course-2012/supplementary_samtools_usage

0. Construct a sam file

Executing the following bash script simulates a paired end illumina dataset using wgsim (available in SAMtools), and runs the bwa aligner to create a paired end sam file.

 $ ./init.sh
Sanity check;
 $ ls -l | grep \\.sam$
 -rw-r--r--   1 visg_user visg  1248911 18 Oct 16:55 aln.sam

1. Testing aln.sam file type

The unix command file describes a filetype;

 echo "[Type]" >&2 
 file aln.sam

2. Converting sam to bam files

Storing sequence alignment maps as binary bam files is efficient in terms of compression. Most samtools <commands></commands> operate on bam files;

 qual=20
 echo "[sam]" >&2 
 samtools view -q $qual -bS -o aln.bam aln.sam

3. Testing aln.bam file type

the filetype of this file should be a binary format;

 echo "[Type]" >&2 
 file aln.bam

4. Printing header information

The optional argument samtool view -H outputs the header information of a bam file;

 echo "[Printing]" >&2
 samtools view -H aln.bam

5. Extracting specific column of interest from body of sam file

Using samtools view -S to view a sam file, we are piping the output perl and awk , in this example to extract only the second column;

 echo "[Examining]" >&2 
 samtools view -S aln.sam  | perl -lane 'print @F[1]' | head -n 10
 echo "[Examining]" >&2 
 samtools view -S aln.sam  | awk '{print $2}' | head -n 10 

6. Extracting specific column of interest from body of bam file

Repeating above to extract a specific column from a bam file;

 echo "[Examining]" >&2 
 samtools view aln.bam  | perl -lane 'print @F[1]' | head -n 10
 echo "[Examining]" >&2 
 samtools view aln.bam  | awk '{print $2}' | head -n 10

7. Construct commands to explore other columns

Using flow control, a loop is iterating the variable $i from 1 to 11, and print the command required to extract column $i. Indenting is used inside the loop for code readability;

 for i in {1..11}; 
 do
  echo "[Command]" >&2 
  echo "samtools view -S aln.sam  | awk '{print \$$i}' | head -n 10" >&2 
 done

8. Sorting the bam file

Executing samtools sort sorts the reads by geographic location based on alignment position (column 4 pos). Many downstream visualization programs only operate on sorted bam files, also required for indexing.

 echo "[sorting]" >&2 
 samtools sort aln.bam aln_sorted

9. Testing for an index file

Current there should be no index file in existence which has a .bai extension. An index file is required to index alignment reads to all references which allows post processing filtering on specific references by region of interest.

 echo "[list]" >&2 
 ls -l | grep "\.bai$"

10. Create an index

Create an index using samtools index;

 echo "[create]" >&2
 samtools index aln_sorted.bam

11. Checking that the index file has been created

Any sorted bam file that is indexed should now contain the extension .bam.bai;

 echo "[list]" >&2
 ls -l | grep "\.bai$"

12. alignment statistics (idxstats)

The statistics from samtools idxstats output alignment index statistics;

 echo "[bam]" >&2
 echo "refname                 length  mapped" >&2
 samtools idxstats aln_sorted.bam

These are of the form;




13. alignment statistics (flagstat)

 echo "[simple]" >&2
 samtools flagstat aln_sorted.bam

14. Counting matches in a bam file

 echo "[Count]" >&2
 samtools view -c aln_sorted.bam

15. samtools text alignment viewer (requires sorted indexed bam)

This is a text based viewer, useful for a quick snapshot view of sorted bam files.

 echo "[Running]" >&2
 sleep 5
 samtools tview aln_sorted.bam ../05.reference/pool1.fasta

16. Backtransforming from bam to sam (some tools require sam format)

Some third party tools only work with sam files, samtools view -o [file.sam] writes output to a sam file;

 samtools view -o aln_sorted.sam aln_sorted.bam

17. merging several bam files (454 example)

This is an example merging multiple bam files derived from single end mapping of the roche 454 dataset into a sinlge bam file.

 samtools merge pooled1.bam ../11.Samtools_Filter_Pool1/Pool1_Pop[1-8].filtered.bam 

18. Filter on specific reference region

In order to filter, we need know what reference sequence and sequence range to filter on, we can retrieve useful information from the header of the bam file;

 echo "[Printing]" >&2
 samtools view -H aln.bam

19. Filter on reference CO_Pool1_contig00004, region 500-1000

Providing the optional argument [region1] = CO_Pool1_contig00004:500-1000, samtools will filter to STDOUT only those sequences on a sorted indexed bam file

 echo "[Filter]" >&2
 samtools view aln_sorted.bam CO_Pool1_contig00004:500-1000 | head -n 10

20. How many records are in the selected CO_Pool1_contig00004 region 500-1000?

Piping the output towc -l will count of sequences in the region, or just use samtools view -c;

 echo "[How]" >&2
 samtools view aln_sorted.bam CO_Pool1_contig00004:500-1000 | wc -l
 echo "[using]" >&2
 samtools view -c aln_sorted.bam CO_Pool1_contig00004:500-1000

21. Piping samtools commands (doesnt store intermediate files)

Piping multiple samtools commands saves on disk space by streaming the output of a samtools command as the input of another in a sequential chain, however if an error occurs it more is difficult to trace where exactly the error is;

 echo "[SAM]" >&2
 samtools view -bS aln.sam | samtools sort - aln_sorted

The - symbol in samtools it to tell the samtools program to take the input from pipe