A collection of useful algorithms and scripts written for bioinformatics and genetics research.
- Python 3.7 +
Programs that find optimal local or global alignments for nucleotide and amino acid sequences.
Needleman-Wunsch algorithm for aligning two nucleotide sequences.
global_align.py [fasta] -m [match] -s [mis] -d [indel]
Required Arguments
- fasta: File containing sequence data for alignment, denoted by ">"
- -m / --match: Alignment score per match
- -s / --mismatch: Penalty per mismatch
- -d / --indel: Penalty per insertion or deletion
Optional Arguments
- -a: Output Alignment
Needleman-Wunsch algorithm for aligning two amino acid sequences.
Scored using BLOSUM62 scoring matrix.
global_align_aa.py [fasta] -d [indel]
Required Arguments
- fasta: File containing sequence data for alignment, denoted by ">"
- -d / --indel: Penalty per insertion or deletion
Optional Arguments
- -a: Output Alignment
Smith-Waterman algorithm for aligning two nucleotide sequences.
local_align.py [fasta] -m [match] -s [mis] -d [indel]
Required Arguments
- fasta: File containing sequence data for alignment, denoted by ">"
- -m / --match: Alignment score per match
- -s / --mismatch: Penalty per mismatch
- -d / --indel: Penalty per insertion or deletion
Optional Arguments
- -a: Output Alignment
Smith-Waterman algorithm for aligning two amino acid sequences.
Scored using PAM250 scoring matrix.
local_align_aa.py [fasta] -d [indel]
Required Arguments
- fasta: File containing sequence data for alignment, denoted by ">"
- -m / --match: Alignment score per match
- -s / --mismatch: Penalty per mismatch
- -d / --indel: Penalty per insertion or deletion
Optional Arguments
- -a: Output Alignment
Programs for parsing sequence data or converting file types
Program for parsing sequence data from FASTA files
parse_fasta.py [fasta]
Required Arguments
- fasta: File containing sequence data for alignment, denoted by ">"
Optional Arguments
- -m / --multi: File contains multiple sequences
-
parse_fasta(fasta)
- fasta (str): String denoting file name
- return (str): Sequence data
-
parse_multiseq(fasta)
- fasta (str): String denoting file name
- return (list): List containing sequence data
Program for parsing sequence data from FASTQ files, and converting FASTQ to FASTA
parse_fastq.py [fastq]
Required Arguments
- fastq: FASTQ file storing sequence data and quality scores.
Optional Arguments
- -f / --fasta: Output sequences in FASTA format
parse_fastq(fastq)
- fastq (str): String denoting file name
- return (dict): Dictionary containing sequence data using id's as keys
- Sequence data denoted as tuple containing (Sequence, Quality Scores)
Hidden Markov Model
Programs for predicting hidden states and probabilities in a sequence using Hidden Markov Models
Program that uses Viterbi Algorithm to create HMM's and predict optimal hidden paths and probabilities
hmm.py [hmm_file] [parse_order] -p [path] -s [sequence] [action]
Required Arguments
- hmm_file: HMM file containing states, initial state probabilities, transition probabilities, symbols emitted, and symbol emission probability separated by lines starting with "-"
- parse_order: Five character string denoting order in which data is presented in file, eg. "qitse"
- q: States
- i: Initial state probabilities
- t: State transition probabilities
- s: Symbols emitted
- e: Emission probabilities
Input Data
-
-p / --path: Hidden path HMM will follow
- Required For: (-d / --dprob) and (-o / --oprob)
-
-s / --seq: Sequence of symbols HMM will emit
- Required For: (-v / --viterbi), (-e / --eprob) and (-o / --oprob)
Actions
- -v / --viterbi: Viterbi algorithm for finding optimal path given an emitted sequence
- -e / --eprob: Find probability an HMM outputs a given sequence
- -d / --dprob: Find probability an HMM outputs a given hidden path
- -o / --oprob: Find probability an HMM outputs a given sequence following a given path
Example HMM File
Parse Order: "qitse"
A B
--------
A B
0.5 0.5
--------
A B
A 0.641 0.359
B 0.729 0.271
--------
x y z
--------
x y z
A 0.117 0.691 0.192
B 0.097 0.42 0.483
Input Format
-
States: Possible Hidden States
- Format: Single Line Tab-Delimited
-
Initial State Probabilities: Probability of starting in state
- Format: Double Line Tab-Delimited.
- Top Row: States
- Bottom Row: Probabilities
- Format: Double Line Tab-Delimited.
-
Transition Probabilities: Probability of transitioning between states
- Format: Tab-Delimited Matrix
- Row: Current State
- Column: Transition State
- Format: Tab-Delimited Matrix
-
Symbols: Possible Symbols Emitted
- Format: Single Line Tab-Delimited
-
Transition Probabilities: Probability of state emitting a given symbol
- Format: Tab-Delimited Matrix
- Row: Current State
- Column: Symbol Emitted
- Format: Tab-Delimited Matrix
Programs for querying reads and fragments against sequences and databases
Program that implements an Aho-Corasick Trie to rapidly query a set of reads against every position in a sequence database
aho-corasick.py -d [database] -q [query] -o [output]
Required Arguments
- -d / --database: Single or Multi-Line Text or FASTA file that contains our database sequence
- -q / --query: File containing one read per line to be queried against our database sequence
- -o / --output: Name for output files
- "output".tsv: Tab-separated file containing matched reads, start index, and end index
- "output"_stats.tsv: Tab-separated file containing Expected vs Actual matches in total and per read
(WIP) Tools for RNA Seq Analysis
Programs that use scanpy to perform various QC control on scRNA-Seq data
Programs that perform expression metrics such as RPKM and TPM
Programs that manipulate or extract data from a sequence
Script that returns data or manipulates a nucleotide sequence
dna_map.py [sequence] [actions]
Required Arguments
- sequence: Sequence to run script on
Actions
- -l / --length: Output sequence length
- -n / --nuc: Output nucleotide counts
- -r / --rna: Convert DNA sequence to RNA
- -c / --comp: Output reverse-complementary strand
- -p / --protein: Convert nucleotide sequence to amino acids*
- *Requires sequence to be able split into codons (length divisible by 3)
Program that finds Open Reading Frames in a DNA sequence.
orf_finder.py -f [fasta] -m [min_size]
Required Arguments
- -f / --file: FASTA file containing DNA sequence
- -m / --minbp: Minimum base-pair length for ORF's
Optional Arguments
- -n / --nested: Include nested ORF's
Output
- Tab Separated Values containing Start Index, End Index, and Frame
Program that locates ORF's then predicts Genes in a DNA sequence
gene_finder.py -f [fasta] -m [min_size]
Required Arguments
- -f / --file: FASTA file containing DNA sequence
- -m / --minbp: Minimum base-pair length for ORF's
Optional Arguments
- -n / --nested: Include nested Genes
Output
- Tab Separated Values containing Gene Label, Forward/Reverse Strand, Frame, Start Index, End Index, and Amino Acid Sequence