Skip to content

Quality control workflow

mdavy86 edited this page Oct 25, 2012 · 18 revisions

Table of Contents

Resources

Documentation

In R we can also the documentation using tkWidgets, in order to do this we must install the R package;



Now to interact with vignette documentation and code chunks

 library(tkWidgets)
 vExplorer()

Select the package Shortead => Overview.Rnw, and click on view PDF. Code chunks can be executed in sequential order while reading the documentation.

Quality control workflow

The help manual pages can be invoked for any core/contributed function run within this workflow e.g.

 library(ShortRead)
 ## Display the help page for the qa() function
 help(qa)

The ShortRead package contains a function called qa() which creates a quality report from several input formats, (usually a Fastq file).

 type=c("SolexaExport", "SolexaRealign", "Bowtie", "MAQMap", "MAQMapShort", "fastq", "BAM")

This function produces similar outputs to FastqQC but executes slightly faster. Functionality has been written to perform quality control either;

  1. on a ShortReadQ structure of NGS reads loading into R using readFastq()
  2. Specify the directory to run qa() directly from the NGS files.

0. Setup

Go to the QC workflow directory;

 cd $HOME/VISG-course-2012/supplementary_QC

The first command clears out any existing R objects from your current workspace

 # Clear working directory objects
 rm(list=ls())

Any contributed libraries need to be installed once, but loaded each time they are used with either require(), or library();

require(ShortRead)

Loading help documentation vignette;

 # Getting vignette help
 if(interactive()) vignette("Overview") ## Quality assessment, page 8 

The next commands source some global variables in the file ~/VISG-course-2012/supplementary_QC/globals.R

# Globals
cat("[ Code to source() ]\n")
cat(readLines("globals.R"), sep="\n")

# source globals code
source("globals.R")

cat("[ Directory and Fastq filenames ]\n")
print(indir)
print(fileNames)

1. Path to first example Fastq file

Define the path fo the first Fastq file in fileNames object to the R object 'file1;

 file1 <- file.path(indir, fileNames[1])
 print(file)

2. Read in Fastq example

Read in the first Fastq file into R using ShortReads readFastq() which creates an instance of the ShorReadQ class. This structure contains the sequence data, quality information, and header information;

 rfq       <- readFastq(file1, qualityType="Auto")

3. Show method

Using print() or show() displays auxillary information about the ShortReadQ structure, in this Roche 454 example we can tell that there is a variable number of cycles from the ragged array length: 4936 reads; width: 71..1200 cycles information;

 print(rfq)

4. Class of rfq is an instance of ShortReadQ S4 class

Checking the class;

 class(rfq)

5. Print all S4 class slotNames

Checking the slot names for the S4 class, usually there are accessor methods of the same name as the slots, in this case there are function methods sread(), quality(), and id(), to access slot components of the ShortReadQ object;

 slotNames(rfq)

6. sread() accessor method

Accessor method for the underlying reads, note only a truncated summary of the variable width reads are printed not the entire set of NGS reads;

 sread(rfq)
  A DNAStringSet instance of length 4936
       width seq
   [1]   263 GACTACACGTAGTATGGCTCGCTGTTTCCAAC...GTCTCTCAAGGCACACAGGGGAGTAGGNGNNN
   [2]   161 GACTACACGTAGTATGGCTCGCTGGTTGGCGA...TCAAGGCACACAGGGGATAGGNNNNNNNNNNN
   [3]   196 GACTACACGTAGTATGGGCTCGCTGGAAACCT...GTCGGCGTCTCTCAAGGCACACAGGGGATAGG
   [4]   318 GACTACACGTAGTATGGCTCGCTGCCCGTTCG...CAAGGCACACAGGAGGTAGGAGNNNNNNNNNN
   [5]   405 GACTACACGTAGTATGGGCTCGCTGTAAAGTT...NNNNNNNNNNNNNNTNNTNNNNNNNNNNNNNN
   [6]   231 GACTACACGTAGTATGGGCTCGCTGTCCTACT...GTCGGCGTCTCTCAAGGCACACAGGGGATAGG
   [7]   236 GACTACACGTAGTATGGGCTCGCTGGTAATTC...GTCGGCGTCTCTCAAGGCACACAGGGGATAGG
   [8]   509 GACTACACGTAGTATGGGCTCGCTGGGCCACA...CTCAAGGCACACAGGAGTGAGNNNNNNNNNNN
   [9]   516 GACTACACGTAGTATGGGCTCGCTGCATGGCA...GTACGACGAGNGNGNGNGNGNGNNNNNNNNNN
   ...   ... ...
 [4928]   637 GACTACACGTAGTATGGGCTCGCTGCATATAA...AAGTCGTAAAGTAGTAAAGGTTCGTNTAACGG
 [4929]   516 GACTACACGTAGTATGGGCTCGCTGCCAATTG...GATGGATGCCGAAAAGGGCGTAGGCTTTAAGG
 [4930]   525 GACTACACGTAGTATGGGCTCGCTGTAAATGT...TTTAGGTATTATNTTACATGTAATGCTAAAAC
 [4931]   533 GACTACACGTAGTATGGGCTCGCTGATGCACG...AGGTGGTTCGGTCGGTCGTCCTCGTTAACCGG
 [4932]   631 GACTACACGTAGTATGGGCTCGCTGGTTTTTC...CTACTACGTTATTACGTTTAGTCGTTTTAAGG 
 [4933]   777 GACTACACGTAGTATGGGCTCGCTGTCATTTT...GTACGTTCCGTTTCGTAACGGTCTTTACGGGG
 [4934]   449 GACTACACGTAGTATGGGCTCGCTGCTGAACT...GGTCTCTCAAGGCACACAAGGGGCTAGGNNNN
 [4935]   557 GACTACACGTAGTATGGGCTCGCTGCCCTGGG...AACGAAACTTATAGAAAATTACTNAACTTAAG
 [4936]   621 GACTACACGTAGTATGGGGCTCGCTGATTATT...TTAACGGTAACCGTAGAACCTTTTTTAAAAGG

7. id() accessor method

Accessor method for the underlying header ids, note only a truncated summary of the reads are printed not the entire set of NGS reads; id(rfq)

  A BStringSet instance of length 4936
       width seq
   [1]    14 GYSSWG201ANAAU
   [2]    14 GYSSWG201AKKAN
   [3]    14 GYSSWG201BC6S3
   [4]    14 GYSSWG201BK297
   [5]    14 GYSSWG201BF59S
   [6]    14 GYSSWG201APQ25
   [7]    14 GYSSWG201ARCVR
   [8]    14 GYSSWG201B2QSX
   [9]    14 GYSSWG201BUQXE
   ...   ... ...
 [4928]    14 GYSSWG201AO4TW
 [4929]    14 GYSSWG201BYFJ6
 [4930]    14 GYSSWG201AH034
 [4931]    14 GYSSWG201BKBW3
 [4932]    14 GYSSWG201A3DQB
 [4933]    14 GYSSWG201A6O0R
 [4934]    14 GYSSWG201BUNO4
 [4935]    14 GYSSWG201BTU4B
 [4936]    14 GYSSWG201ALL3I

8. quality() accessor method

Accessor method for the underlying quality scores, note only a truncated summary of the variable width scores are printed not the entire set of quality scores;

quality(rfq) class: FastqQuality quality:

  A BStringSet instance of length 4936
       width seq
   [1]   263 IIIIIIIIIIIIIIFC?DGIIIE;334<<III...?@??<75444457755557533-21..!/!!!
   [2]   161 IIIIIIIIIIIIIIIHHIIIIII;;99;;I?3...IIHHHHHIIIIIEEBBIHHHH!!!!!!!!!!!
   [3]   196 IIIIIIIIIIIIIIF777EIIIIIBB256EEI...IIIIIIIIIIIIIIIIIIIIIIIFEEFIIIBB
   [4]   318 IIIIIIIIIIIIIF@???EIIIIIIIIIIIII...111556631111..--..--.-!!!!!!!!!!
   [5]   405 IIIIEEEEGIEIFB;0//8>:8680,,-0300...!!!!!!!!!!!!!!,!!,!!!!!!!!!!!!!!
   [6]   231 IIIIIIIIIIIIIIH777IIIIIIIIIIIIII...IIIIIIIIIIIIIIIIIIFFFII>@==>DFFF
   [7]   236 IIIIEEEEIIIIIEA4449BIIIFDCBGGIII...7=>===<?@@@@@@@;;;>;;;<::::;;==@
   [8]   509 IIIIEEEIIIIIIED444?GFFBB333GGIII...11111---++++---0000-+!!!!!!!!!!!
   [9]   516 IIIHBBBBEEBBB@@111>DAD@DDDDECCEC...++++++++++!+!+!+!+!+!+!!!!!!!!!!
   ...   ... ...
 [4928]   637 IIIIEEBBBFB:::===CEB@D?FAAAB8......11111-++++++--0000-----0+!++++++
 [4929]   516 IIIIEEFFEEF==86310?@>>>>>344555B...--010---+++++1130-++++--0001--++
 [4930]   525 IIIIIIIIIIIIIII444IIIIEF989<<GEB...0-++++++++++!+++-001133555555779
 [4931]   533 IIIIEEEEIIEEEBAAADBGEIEIIIHFFGEG...+++++++++++++++-++++++++++++++++
 [4932]   631 IIIIEEEEIIEEEBB300<@>@:7---.5001...0-----0-------01-....-./33448899
 [4933]   777 IIIGEFDC>>=A7777889:89877577----...+++++--++++++++-+++++++++++++-00
 [4934]   449 IIIIEEEEIIIIEEEGGGIIIIIIIHDGGGC<...,,-,,,--2222.....265544,,,,,!!!!
 [4935]   557 IIIIIIEBEEDDD=90..8:69897,,,----...--+++++++++---+++++++++!+++++--1
 [4936]   621 IIIIIIEEIIIIIIIBBBBG>==?@>88333/...---+++++++--1111++,,......111355

9. Run Quality control script

Running qa() on the ShortReadQ structure of NGS reads in the R object rfq;

 qaSummary <- qa(rfq, lane="Roche 454")

10. Create html report

This generic function summarizes results from evaluation of qa() into a report. Available report formats vary depending on the data analysed;

 fname <- report(qaSummary, dest="/tmp")
 print(fname)

11. Open html report

Load a given URL into a WWW browser, in this case an html file called index.html on the filesystem;

 if(interactive()) browseURL(fname)

We can execute any published code snippet for any QC component in the html report. The code example is opied from the report but we had to replace 'qa' with the name of the R object created in (9) 'qaSummary';

 df <- qaSummary[["readQualityScore"]]
 ShortRead:::.plotReadQuality(df[df$type=="read",])

12. Snippet to Investigate qaSummary structure

This code snippet traverses through all the qaSummary object elements and prints a the header information of their underlying structure.

cat("[ qaSummary structure ]\n")
for(i in names(qaSummary)) {
  if(i == "perTile") next;
  if(i == "perCycle") {
    for(j in names(qaSummary[[i]])) {
      cat("[", i, "][", j, "]\n")
      print(head(qaSummary[[i]]nowiki>[[j]]))
      cat("...\n\n")
    }
  } else {
    cat("[", i, "]\n")
    print(head(qaSummary<tt>[</tt>[i]]))
    cat("...\n\n")
  }
}

The R function str() is a formal way to compactly display the structure of an arbitrary R object

13. Generate qa reports for all Fastq files

For a set of Fastq fileNames, create an output directory, and run qa() on each file, and generate an html report;

for(f in fileNames) {
  # Truncate the ".fastq$" off names
  fdir      <- gsub("\\..+?$", "", f)
  # Generate a qc object
  qaSummary <- qa(indir, pattern=f, lane=fdir, type="fastq")
  # Generate an html report
  fname     <- report(qaSummary, dest=file.path(path, fdir))
  # Be verbose
  cat("[", fdir , "=>", fname, "]\n")
}