-
Notifications
You must be signed in to change notification settings - Fork 2
Quality control workflow
In R we can also the documentation using tkWidgets, in order to do this we must install the R package;
Now to interact with vignette documentation and code chunks
library(tkWidgets) vExplorer()
Select the package Shortead => Overview.Rnw, and click on view PDF. Code chunks can be executed in sequential order while reading the documentation.
The help manual pages can be invoked for any core/contributed function run within this workflow e.g.
library(ShortRead) ## Display the help page for the qa() function help(qa)
The ShortRead package contains a function called qa() which creates a quality report from several input formats, (usually a Fastq file).
type=c("SolexaExport", "SolexaRealign", "Bowtie", "MAQMap", "MAQMapShort", "fastq", "BAM")
This function produces similar outputs to FastqQC but executes slightly faster. Functionality has been written to perform quality control either;
- on a ShortReadQ structure of NGS reads loading into R using readFastq()
- Specify the directory to run qa() directly from the NGS files.
Go to the QC workflow directory;
cd $HOME/VISG-course-2012/supplementary_QC
The first command clears out any existing R objects from your current workspace
# Clear working directory objects rm(list=ls())
Any contributed libraries need to be installed once, but loaded each time they are used with either require(), or library();
require(ShortRead)
Loading help documentation vignette;
# Getting vignette help if(interactive()) vignette("Overview") ## Quality assessment, page 8
The next commands source some global variables in the file ~/VISG-course-2012/supplementary_QC/globals.R
# Globals cat("[ Code to source() ]\n") cat(readLines("globals.R"), sep="\n") # source globals code source("globals.R") cat("[ Directory and Fastq filenames ]\n") print(indir) print(fileNames)
Define the path fo the first Fastq file in fileNames object to the R object 'file1;
file1 <- file.path(indir, fileNames[1]) print(file)
Read in the first Fastq file into R using ShortReads readFastq() which creates an instance of the ShorReadQ class. This structure contains the sequence data, quality information, and header information;
rfq <- readFastq(file1, qualityType="Auto")
Using print() or show() displays auxillary information about the ShortReadQ structure, in this Roche 454 example we can tell that there is a variable number of cycles from the ragged array length: 4936 reads; width: 71..1200 cycles
information;
print(rfq)
Checking the class;
class(rfq)
Checking the slot names for the S4 class, usually there are accessor methods of the same name as the slots, in this case there are function methods sread(), quality(), and id(), to access slot components of the ShortReadQ object;
slotNames(rfq)
Accessor method for the underlying reads, note only a truncated summary of the variable width reads are printed not the entire set of NGS reads;
sread(rfq) A DNAStringSet instance of length 4936 width seq [1] 263 GACTACACGTAGTATGGCTCGCTGTTTCCAAC...GTCTCTCAAGGCACACAGGGGAGTAGGNGNNN [2] 161 GACTACACGTAGTATGGCTCGCTGGTTGGCGA...TCAAGGCACACAGGGGATAGGNNNNNNNNNNN [3] 196 GACTACACGTAGTATGGGCTCGCTGGAAACCT...GTCGGCGTCTCTCAAGGCACACAGGGGATAGG [4] 318 GACTACACGTAGTATGGCTCGCTGCCCGTTCG...CAAGGCACACAGGAGGTAGGAGNNNNNNNNNN [5] 405 GACTACACGTAGTATGGGCTCGCTGTAAAGTT...NNNNNNNNNNNNNNTNNTNNNNNNNNNNNNNN [6] 231 GACTACACGTAGTATGGGCTCGCTGTCCTACT...GTCGGCGTCTCTCAAGGCACACAGGGGATAGG [7] 236 GACTACACGTAGTATGGGCTCGCTGGTAATTC...GTCGGCGTCTCTCAAGGCACACAGGGGATAGG [8] 509 GACTACACGTAGTATGGGCTCGCTGGGCCACA...CTCAAGGCACACAGGAGTGAGNNNNNNNNNNN [9] 516 GACTACACGTAGTATGGGCTCGCTGCATGGCA...GTACGACGAGNGNGNGNGNGNGNNNNNNNNNN ... ... ... [4928] 637 GACTACACGTAGTATGGGCTCGCTGCATATAA...AAGTCGTAAAGTAGTAAAGGTTCGTNTAACGG [4929] 516 GACTACACGTAGTATGGGCTCGCTGCCAATTG...GATGGATGCCGAAAAGGGCGTAGGCTTTAAGG [4930] 525 GACTACACGTAGTATGGGCTCGCTGTAAATGT...TTTAGGTATTATNTTACATGTAATGCTAAAAC [4931] 533 GACTACACGTAGTATGGGCTCGCTGATGCACG...AGGTGGTTCGGTCGGTCGTCCTCGTTAACCGG [4932] 631 GACTACACGTAGTATGGGCTCGCTGGTTTTTC...CTACTACGTTATTACGTTTAGTCGTTTTAAGG [4933] 777 GACTACACGTAGTATGGGCTCGCTGTCATTTT...GTACGTTCCGTTTCGTAACGGTCTTTACGGGG [4934] 449 GACTACACGTAGTATGGGCTCGCTGCTGAACT...GGTCTCTCAAGGCACACAAGGGGCTAGGNNNN [4935] 557 GACTACACGTAGTATGGGCTCGCTGCCCTGGG...AACGAAACTTATAGAAAATTACTNAACTTAAG [4936] 621 GACTACACGTAGTATGGGGCTCGCTGATTATT...TTAACGGTAACCGTAGAACCTTTTTTAAAAGG
Accessor method for the underlying header ids, note only a truncated summary of the reads are printed not the entire set of NGS reads; id(rfq)
A BStringSet instance of length 4936 width seq [1] 14 GYSSWG201ANAAU [2] 14 GYSSWG201AKKAN [3] 14 GYSSWG201BC6S3 [4] 14 GYSSWG201BK297 [5] 14 GYSSWG201BF59S [6] 14 GYSSWG201APQ25 [7] 14 GYSSWG201ARCVR [8] 14 GYSSWG201B2QSX [9] 14 GYSSWG201BUQXE ... ... ... [4928] 14 GYSSWG201AO4TW [4929] 14 GYSSWG201BYFJ6 [4930] 14 GYSSWG201AH034 [4931] 14 GYSSWG201BKBW3 [4932] 14 GYSSWG201A3DQB [4933] 14 GYSSWG201A6O0R [4934] 14 GYSSWG201BUNO4 [4935] 14 GYSSWG201BTU4B [4936] 14 GYSSWG201ALL3I
Accessor method for the underlying quality scores, note only a truncated summary of the variable width scores are printed not the entire set of quality scores;
quality(rfq) class: FastqQuality quality:
A BStringSet instance of length 4936 width seq [1] 263 IIIIIIIIIIIIIIFC?DGIIIE;334<<III...?@??<75444457755557533-21..!/!!! [2] 161 IIIIIIIIIIIIIIIHHIIIIII;;99;;I?3...IIHHHHHIIIIIEEBBIHHHH!!!!!!!!!!! [3] 196 IIIIIIIIIIIIIIF777EIIIIIBB256EEI...IIIIIIIIIIIIIIIIIIIIIIIFEEFIIIBB [4] 318 IIIIIIIIIIIIIF@???EIIIIIIIIIIIII...111556631111..--..--.-!!!!!!!!!! [5] 405 IIIIEEEEGIEIFB;0//8>:8680,,-0300...!!!!!!!!!!!!!!,!!,!!!!!!!!!!!!!! [6] 231 IIIIIIIIIIIIIIH777IIIIIIIIIIIIII...IIIIIIIIIIIIIIIIIIFFFII>@==>DFFF [7] 236 IIIIEEEEIIIIIEA4449BIIIFDCBGGIII...7=>===<?@@@@@@@;;;>;;;<::::;;==@ [8] 509 IIIIEEEIIIIIIED444?GFFBB333GGIII...11111---++++---0000-+!!!!!!!!!!! [9] 516 IIIHBBBBEEBBB@@111>DAD@DDDDECCEC...++++++++++!+!+!+!+!+!+!!!!!!!!!! ... ... ... [4928] 637 IIIIEEBBBFB:::===CEB@D?FAAAB8......11111-++++++--0000-----0+!++++++ [4929] 516 IIIIEEFFEEF==86310?@>>>>>344555B...--010---+++++1130-++++--0001--++ [4930] 525 IIIIIIIIIIIIIII444IIIIEF989<<GEB...0-++++++++++!+++-001133555555779 [4931] 533 IIIIEEEEIIEEEBAAADBGEIEIIIHFFGEG...+++++++++++++++-++++++++++++++++ [4932] 631 IIIIEEEEIIEEEBB300<@>@:7---.5001...0-----0-------01-....-./33448899 [4933] 777 IIIGEFDC>>=A7777889:89877577----...+++++--++++++++-+++++++++++++-00 [4934] 449 IIIIEEEEIIIIEEEGGGIIIIIIIHDGGGC<...,,-,,,--2222.....265544,,,,,!!!! [4935] 557 IIIIIIEBEEDDD=90..8:69897,,,----...--+++++++++---+++++++++!+++++--1 [4936] 621 IIIIIIEEIIIIIIIBBBBG>==?@>88333/...---+++++++--1111++,,......111355
Running qa() on the ShortReadQ structure of NGS reads in the R object rfq;
qaSummary <- qa(rfq, lane="Roche 454")
This generic function summarizes results from evaluation of qa() into a report. Available report formats vary depending on the data analysed;
fname <- report(qaSummary, dest="/tmp") print(fname)
Load a given URL into a WWW browser, in this case an html file called index.html on the filesystem;
if(interactive()) browseURL(fname)
We can execute any published code snippet for any QC component in the html report. The code example is opied from the report but we had to replace 'qa' with the name of the R object created in (9) 'qaSummary';
df <- qaSummary[["readQualityScore"]] ShortRead:::.plotReadQuality(df[df$type=="read",])
This code snippet traverses through all the qaSummary object elements and prints a the header information of their underlying structure.
cat("[ qaSummary structure ]\n") for(i in names(qaSummary)) { if(i == "perTile") next; if(i == "perCycle") { for(j in names(qaSummary[[i]])) { cat("[", i, "][", j, "]\n") print(head(qaSummary[[i]]nowiki>[[j]])) cat("...\n\n") } } else { cat("[", i, "]\n") print(head(qaSummary<tt>[</tt>[i]])) cat("...\n\n") } }
The R function str() is a formal way to compactly display the structure of an arbitrary R object
For a set of Fastq fileNames, create an output directory, and run qa() on each file, and generate an html report;
for(f in fileNames) { # Truncate the ".fastq$" off names fdir <- gsub("\\..+?$", "", f) # Generate a qc object qaSummary <- qa(indir, pattern=f, lane=fdir, type="fastq") # Generate an html report fname <- report(qaSummary, dest=file.path(path, fdir)) # Be verbose cat("[", fdir , "=>", fname, "]\n") }