|
| 1 | +# AwesomeAssembler |
| 2 | + |
| 3 | +CI Status: |
| 4 | + |
| 5 | +master: |
| 6 | + |
| 7 | +dev: |
| 8 | + |
| 9 | +## Introduction |
| 10 | + |
| 11 | +AwesomeAssembler is an experimental long read assembler based on an idea called _hinging_. Now AwesomeAssembler is at research prototype stage. |
| 12 | + |
| 13 | +## Pipeline Overview |
| 14 | + |
| 15 | +AwesomeAssembler is an OLC(Overlap-Layout-Consensus) assembler. The idea of the pipeline is shown below. |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +At a high level, the algorithm can be thought of a variation of the classical greedy algorithm. |
| 20 | +The main difference with the greedy algorithm is that rather than each read having a single successor, |
| 21 | +and a single predecessor, we allow a small subset of reads to have a higher number of successors/predecessors. |
| 22 | +This subset is identified by a process called _hinging_. This helps us to recover the graph structure |
| 23 | +directly during assembly. |
| 24 | + |
| 25 | +Another significant difference from HGAP or Falcon pipeline is that it does not have a pre-assembly or read correction step. |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +## Algorithm Details |
| 30 | + |
| 31 | +### Reads filtering |
| 32 | +Reads filtering filters reads that have long chimer in the middle, and short reads. |
| 33 | +Reads which can have higher number of predecessors/successors are also identified there. |
| 34 | +This is implemented in `filter/filter.cpp` |
| 35 | + |
| 36 | +### Layout |
| 37 | +The layout is implemented in `layout/hinging.cpp`. It is done by a variant of the greedy algorithm. |
| 38 | + |
| 39 | +The graph output by the layout stage is post-processed by running `scripts/pruning_and_clipping.py`. |
| 40 | +One output is a graphml file which is the graph representation of the backbone. |
| 41 | +This removes dead ends and Z-structures from the graph enabling easy condensation. |
| 42 | +It can be analyzed and visualized, etc. |
| 43 | + |
| 44 | + |
| 45 | +## Parameters |
| 46 | + |
| 47 | +In the pipeline described above, most programs not only takes the input file and output file as arguments, but also require a configuration file in ini format. This consists parameters for each step in the pipeline, and their usage and effects are explained below: |
| 48 | + |
| 49 | + |
| 50 | +###[filter] |
| 51 | +- length_threshold = 6500; // Length threshold for reads to be considered in the backbone |
| 52 | +- quality_threshold = 0.23; // Quality threshold for edges to be considered in the backbone |
| 53 | +- n_iter = 2; // iterations of filtering, the filtering needs several iterations, because when filter reads, you got rid of some edges; when filter edges, you got rid of some reads (if the last edge is filtered.) Typically 2-3 iterations will be enough. |
| 54 | +- aln_threshold = 2500; // Length of alignment for edges to be considered in the backbone |
| 55 | +- min_cov = 5; // Minimal coverage for a segment to be considered not chimer/adaptor |
| 56 | +- cut_off = 200; // A parameter for identifying long chimer in the middle of a read |
| 57 | +- theta = 300; // A parameter for tolerance of the overhang length when looking for right extension. |
| 58 | + |
| 59 | + |
| 60 | +###[running] |
| 61 | +- n_proc = 12; // number of CPUs for layout step |
| 62 | + |
| 63 | +###[draft] |
| 64 | +- min_cov = 10; //obsolete |
| 65 | +- trim = 200; //obsolete |
| 66 | +- edge_safe = 100; //obsolete |
| 67 | +- tspace = 900; //space between new "trace points" |
| 68 | + |
| 69 | + |
| 70 | +###[consensus] |
| 71 | +- min_length = 2000; // Minimal length of reads used for final consensus |
| 72 | +- trim_end = 200; // Trim ends for alignments for final consensus |
| 73 | +- best_n = 1; // If one read has multiple alignments with the bacbone assembly, choose the longest n segments for consensus. |
| 74 | +- quality_threshold = 0.23; // alignment quality threshold |
| 75 | + |
| 76 | +# Installation |
| 77 | + |
| 78 | +This software is still at prototype stage so it is not well packaged, however it is designed in a modular flavor so different combinations of methods can be tested. |
| 79 | + |
| 80 | +Installing the software is very easy. |
| 81 | + |
| 82 | +``` |
| 83 | +git clone https://github.com/Eureka22/AwesomeAssembler.git |
| 84 | +git submodule init |
| 85 | +git submodule update |
| 86 | +./build.sh |
| 87 | +``` |
| 88 | + |
| 89 | +# Running |
| 90 | + |
| 91 | +In order to call the programs from anywhere, I suggest one export the directory of binary file to system environment, you can do that by using the script `setup.sh`. |
| 92 | + |
| 93 | +A demo run for assembling the ecoli genome is the following: |
| 94 | + |
| 95 | +``` |
| 96 | +source setup.sh |
| 97 | +mkdir data/ecoli |
| 98 | +cd data/ecoli |
| 99 | +# reads.fasta should be in data/ecoli |
| 100 | +fasta2DB ecoli reads.fasta |
| 101 | +DBsplit -x500 -s100 ecoli |
| 102 | +HPCdaligner -t5 ecoli | csh -v |
| 103 | +# alternatively, you can put output of HPCdaligner to a bash file and edit it to support |
| 104 | +rm ecoli.*.ecoli.* |
| 105 | +LAmerge ecoli.las ecoli.*.las |
| 106 | +rm ecoli.*.las # we only need ecoli.las |
| 107 | +DASqv -c100 ecoli ecoli.las |
| 108 | +
|
| 109 | +# Run filter |
| 110 | +Reads_filter --db ecoli --las ecoli.las -x ecoli --config /utils/nominal.ini |
| 111 | +
|
| 112 | +# Run layout |
| 113 | +hinging --db ecoli --las ecoli.las -x ecoli --config /utils/nominal.ini -o ecoli |
| 114 | +
|
| 115 | +# Run postprocessing |
| 116 | +
|
| 117 | +python pruning_and_clipping.py ecoli.edges.hinges ecoli.hinge.list <identifier-of-run> |
| 118 | +``` |
| 119 | + |
| 120 | +## Debugging |
| 121 | + |
| 122 | +### showing ground truth on graph |
| 123 | +Some programs are for debugging and oberservation. For example, one can get the ground truth by mapping reads to reference and get `ecoli.ecoli.ref.las`. |
| 124 | + |
| 125 | +This `las` file can be parsed to json file for other programs to use. |
| 126 | + |
| 127 | +``` |
| 128 | +run_mapping.py ecoli ecoli.ref ecoli.ecoli.ref.las 1-$ |
| 129 | +``` |
| 130 | + |
| 131 | +In the prune step, if `ecoli.mapping.json` exists, the output `graphml` file will contain the information of ground truth. |
| 132 | + |
| 133 | +### drawing alignment graphs and mapping graphs |
| 134 | +Draw a read, for example 60947, and output figure to `sample` folder (need plus 1 as LAshow counts from 1): |
| 135 | + |
| 136 | +``` |
| 137 | +draw2.py ecoli ecoli.las 60948 sample 100 |
| 138 | +``` |
| 139 | + |
| 140 | +Draw pileup on draft assembly, given a region(start,end): |
| 141 | + |
| 142 | +``` |
| 143 | +draw2_pileup_region.py 3600000 4500000 |
| 144 | +``` |
| 145 | + |
| 146 | +# Results: |
| 147 | + |
| 148 | +For ecoli 160X dataset, after shortening reads to have a mean length of 3500 (with a variance of 1500), the graph is preserved. |
| 149 | + |
| 150 | + |
| 151 | + |
| 152 | +The graph returned by Falcon here is |
| 153 | + |
| 154 | + |
0 commit comments