Skip to content
This repository was archived by the owner on Jul 18, 2023. It is now read-only.

Commit e0cf175

Browse files
Create README.md
1 parent 44f706d commit e0cf175

File tree

1 file changed

+154
-0
lines changed

1 file changed

+154
-0
lines changed

README.md

+154
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# AwesomeAssembler
2+
3+
CI Status:
4+
5+
master:![image](https://travis-ci.com/fxia22/HINGE.svg?token=i41xfGcHb72GYFyZnvtg&branch=master)
6+
7+
dev:![image](https://travis-ci.com/fxia22/HINGE.svg?token=i41xfGcHb72GYFyZnvtg&branch=dev)
8+
9+
## Introduction
10+
11+
AwesomeAssembler is an experimental long read assembler based on an idea called _hinging_. Now AwesomeAssembler is at research prototype stage.
12+
13+
## Pipeline Overview
14+
15+
AwesomeAssembler is an OLC(Overlap-Layout-Consensus) assembler. The idea of the pipeline is shown below.
16+
17+
![image](High_level_overview.png)
18+
19+
At a high level, the algorithm can be thought of a variation of the classical greedy algorithm.
20+
The main difference with the greedy algorithm is that rather than each read having a single successor,
21+
and a single predecessor, we allow a small subset of reads to have a higher number of successors/predecessors.
22+
This subset is identified by a process called _hinging_. This helps us to recover the graph structure
23+
directly during assembly.
24+
25+
Another significant difference from HGAP or Falcon pipeline is that it does not have a pre-assembly or read correction step.
26+
27+
28+
29+
## Algorithm Details
30+
31+
### Reads filtering
32+
Reads filtering filters reads that have long chimer in the middle, and short reads.
33+
Reads which can have higher number of predecessors/successors are also identified there.
34+
This is implemented in `filter/filter.cpp`
35+
36+
### Layout
37+
The layout is implemented in `layout/hinging.cpp`. It is done by a variant of the greedy algorithm.
38+
39+
The graph output by the layout stage is post-processed by running `scripts/pruning_and_clipping.py`.
40+
One output is a graphml file which is the graph representation of the backbone.
41+
This removes dead ends and Z-structures from the graph enabling easy condensation.
42+
It can be analyzed and visualized, etc.
43+
44+
45+
## Parameters
46+
47+
In the pipeline described above, most programs not only takes the input file and output file as arguments, but also require a configuration file in ini format. This consists parameters for each step in the pipeline, and their usage and effects are explained below:
48+
49+
50+
###[filter]
51+
- length_threshold = 6500; // Length threshold for reads to be considered in the backbone
52+
- quality_threshold = 0.23; // Quality threshold for edges to be considered in the backbone
53+
- n_iter = 2; // iterations of filtering, the filtering needs several iterations, because when filter reads, you got rid of some edges; when filter edges, you got rid of some reads (if the last edge is filtered.) Typically 2-3 iterations will be enough.
54+
- aln_threshold = 2500; // Length of alignment for edges to be considered in the backbone
55+
- min_cov = 5; // Minimal coverage for a segment to be considered not chimer/adaptor
56+
- cut_off = 200; // A parameter for identifying long chimer in the middle of a read
57+
- theta = 300; // A parameter for tolerance of the overhang length when looking for right extension.
58+
59+
60+
###[running]
61+
- n_proc = 12; // number of CPUs for layout step
62+
63+
###[draft]
64+
- min_cov = 10; //obsolete
65+
- trim = 200; //obsolete
66+
- edge_safe = 100; //obsolete
67+
- tspace = 900; //space between new "trace points"
68+
69+
70+
###[consensus]
71+
- min_length = 2000; // Minimal length of reads used for final consensus
72+
- trim_end = 200; // Trim ends for alignments for final consensus
73+
- best_n = 1; // If one read has multiple alignments with the bacbone assembly, choose the longest n segments for consensus.
74+
- quality_threshold = 0.23; // alignment quality threshold
75+
76+
# Installation
77+
78+
This software is still at prototype stage so it is not well packaged, however it is designed in a modular flavor so different combinations of methods can be tested.
79+
80+
Installing the software is very easy.
81+
82+
```
83+
git clone https://github.com/Eureka22/AwesomeAssembler.git
84+
git submodule init
85+
git submodule update
86+
./build.sh
87+
```
88+
89+
# Running
90+
91+
In order to call the programs from anywhere, I suggest one export the directory of binary file to system environment, you can do that by using the script `setup.sh`.
92+
93+
A demo run for assembling the ecoli genome is the following:
94+
95+
```
96+
source setup.sh
97+
mkdir data/ecoli
98+
cd data/ecoli
99+
# reads.fasta should be in data/ecoli
100+
fasta2DB ecoli reads.fasta
101+
DBsplit -x500 -s100 ecoli
102+
HPCdaligner -t5 ecoli | csh -v
103+
# alternatively, you can put output of HPCdaligner to a bash file and edit it to support
104+
rm ecoli.*.ecoli.*
105+
LAmerge ecoli.las ecoli.*.las
106+
rm ecoli.*.las # we only need ecoli.las
107+
DASqv -c100 ecoli ecoli.las
108+
109+
# Run filter
110+
Reads_filter --db ecoli --las ecoli.las -x ecoli --config /utils/nominal.ini
111+
112+
# Run layout
113+
hinging --db ecoli --las ecoli.las -x ecoli --config /utils/nominal.ini -o ecoli
114+
115+
# Run postprocessing
116+
117+
python pruning_and_clipping.py ecoli.edges.hinges ecoli.hinge.list <identifier-of-run>
118+
```
119+
120+
## Debugging
121+
122+
### showing ground truth on graph
123+
Some programs are for debugging and oberservation. For example, one can get the ground truth by mapping reads to reference and get `ecoli.ecoli.ref.las`.
124+
125+
This `las` file can be parsed to json file for other programs to use.
126+
127+
```
128+
run_mapping.py ecoli ecoli.ref ecoli.ecoli.ref.las 1-$
129+
```
130+
131+
In the prune step, if `ecoli.mapping.json` exists, the output `graphml` file will contain the information of ground truth.
132+
133+
### drawing alignment graphs and mapping graphs
134+
Draw a read, for example 60947, and output figure to `sample` folder (need plus 1 as LAshow counts from 1):
135+
136+
```
137+
draw2.py ecoli ecoli.las 60948 sample 100
138+
```
139+
140+
Draw pileup on draft assembly, given a region(start,end):
141+
142+
```
143+
draw2_pileup_region.py 3600000 4500000
144+
```
145+
146+
# Results:
147+
148+
For ecoli 160X dataset, after shortening reads to have a mean length of 3500 (with a variance of 1500), the graph is preserved.
149+
150+
![image](ecoli_shortened.png)
151+
152+
The graph returned by Falcon here is
153+
154+
![image](Falcon_ecoli_shortened.png)

0 commit comments

Comments
 (0)