Skip to content

A collection of useful algorithms and scripts for bioinformatics and genomics research.

Notifications You must be signed in to change notification settings

AaronJeeHo/bioinformatics-tools

Repository files navigation

bioinformatics-tools

A collection of useful algorithms and scripts written for bioinformatics and genetics research.

 

Requirements

  • Python 3.7 +

 

Table of Contents



Alignment

Programs that find optimal local or global alignments for nucleotide and amino acid sequences.

 

global_align.py

Needleman-Wunsch algorithm for aligning two nucleotide sequences.

Usage

global_align.py [fasta] -m [match] -s [mis] -d [indel]

Required Arguments

  • fasta: File containing sequence data for alignment, denoted by ">"
  • -m / --match: Alignment score per match
  • -s / --mismatch: Penalty per mismatch
  • -d / --indel: Penalty per insertion or deletion

Optional Arguments

  • -a: Output Alignment

 

global_align_aa.py

Needleman-Wunsch algorithm for aligning two amino acid sequences.
Scored using BLOSUM62 scoring matrix.

Usage

global_align_aa.py [fasta] -d [indel]

Required Arguments

  • fasta: File containing sequence data for alignment, denoted by ">"
  • -d / --indel: Penalty per insertion or deletion

Optional Arguments

  • -a: Output Alignment

 

local_align.py

Smith-Waterman algorithm for aligning two nucleotide sequences.

Usage

local_align.py [fasta] -m [match] -s [mis] -d [indel]

Required Arguments

  • fasta: File containing sequence data for alignment, denoted by ">"
  • -m / --match: Alignment score per match
  • -s / --mismatch: Penalty per mismatch
  • -d / --indel: Penalty per insertion or deletion

Optional Arguments

  • -a: Output Alignment

 

local_align_aa.py

Smith-Waterman algorithm for aligning two amino acid sequences.
Scored using PAM250 scoring matrix.

Usage

local_align_aa.py [fasta] -d [indel]

Required Arguments

  • fasta: File containing sequence data for alignment, denoted by ">"
  • -m / --match: Alignment score per match
  • -s / --mismatch: Penalty per mismatch
  • -d / --indel: Penalty per insertion or deletion

Optional Arguments

  • -a: Output Alignment



File Tools

Programs for parsing sequence data or converting file types

 

parse_fasta.py

Program for parsing sequence data from FASTA files

Usage

parse_fasta.py [fasta]

Required Arguments

  • fasta: File containing sequence data for alignment, denoted by ">"

Optional Arguments

  • -m / --multi: File contains multiple sequences

Callable Functions

  • parse_fasta(fasta)

    • fasta (str): String denoting file name
    • return (str): Sequence data
  • parse_multiseq(fasta)

    • fasta (str): String denoting file name
    • return (list): List containing sequence data

 

parse_fastq.py

Program for parsing sequence data from FASTQ files, and converting FASTQ to FASTA

Usage

parse_fastq.py [fastq]

Required Arguments

  • fastq: FASTQ file storing sequence data and quality scores.

Optional Arguments

  • -f / --fasta: Output sequences in FASTA format

Callable Functions

  • parse_fastq(fastq)
    • fastq (str): String denoting file name
    • return (dict): Dictionary containing sequence data using id's as keys
      • Sequence data denoted as tuple containing (Sequence, Quality Scores)



Hidden Markov Model

Programs for predicting hidden states and probabilities in a sequence using Hidden Markov Models

 

hmm.py

Program that uses Viterbi Algorithm to create HMM's and predict optimal hidden paths and probabilities

Usage

hmm.py [hmm_file] [parse_order] -p [path] -s [sequence] [action]

Required Arguments

  • hmm_file: HMM file containing states, initial state probabilities, transition probabilities, symbols emitted, and symbol emission probability separated by lines starting with "-"
  • parse_order: Five character string denoting order in which data is presented in file, eg. "qitse"
    • q: States
    • i: Initial state probabilities
    • t: State transition probabilities
    • s: Symbols emitted
    • e: Emission probabilities

Input Data

  • -p / --path: Hidden path HMM will follow

    • Required For: (-d / --dprob) and (-o / --oprob)
  • -s / --seq: Sequence of symbols HMM will emit

    • Required For: (-v / --viterbi), (-e / --eprob) and (-o / --oprob)

Actions

  • -v / --viterbi: Viterbi algorithm for finding optimal path given an emitted sequence
  • -e / --eprob: Find probability an HMM outputs a given sequence
  • -d / --dprob: Find probability an HMM outputs a given hidden path
  • -o / --oprob: Find probability an HMM outputs a given sequence following a given path

HMM File Format

Example HMM File

Parse Order: "qitse"

A   B
--------
A   B
0.5 0.5
--------
    A   B
A   0.641   0.359
B   0.729   0.271
--------
x   y   z
--------
    x   y   z
A   0.117   0.691   0.192   
B   0.097   0.42    0.483

Input Format

  • States: Possible Hidden States

    • Format: Single Line Tab-Delimited
  • Initial State Probabilities: Probability of starting in state

    • Format: Double Line Tab-Delimited.
      • Top Row: States
      • Bottom Row: Probabilities
  • Transition Probabilities: Probability of transitioning between states

    • Format: Tab-Delimited Matrix
      • Row: Current State
      • Column: Transition State
  • Symbols: Possible Symbols Emitted

    • Format: Single Line Tab-Delimited
  • Transition Probabilities: Probability of state emitting a given symbol

    • Format: Tab-Delimited Matrix
      • Row: Current State
      • Column: Symbol Emitted



Pattern Matching

Programs for querying reads and fragments against sequences and databases

 

aho-corasick.py

Program that implements an Aho-Corasick Trie to rapidly query a set of reads against every position in a sequence database

Usage

aho-corasick.py -d [database] -q [query] -o [output]

Required Arguments

  • -d / --database: Single or Multi-Line Text or FASTA file that contains our database sequence
  • -q / --query: File containing one read per line to be queried against our database sequence
  • -o / --output: Name for output files
    • "output".tsv: Tab-separated file containing matched reads, start index, and end index
    • "output"_stats.tsv: Tab-separated file containing Expected vs Actual matches in total and per read



RNA SeqTools

(WIP) Tools for RNA Seq Analysis

 

quality_control.py

Programs that use scanpy to perform various QC control on scRNA-Seq data

 

quantify_exp.py

Programs that perform expression metrics such as RPKM and TPM



Seq Tools

Programs that manipulate or extract data from a sequence

 

dna_map.py

Script that returns data or manipulates a nucleotide sequence

Usage

dna_map.py [sequence] [actions]

Required Arguments

  • sequence: Sequence to run script on

Actions

  • -l / --length: Output sequence length
  • -n / --nuc: Output nucleotide counts
  • -r / --rna: Convert DNA sequence to RNA
  • -c / --comp: Output reverse-complementary strand
  • -p / --protein: Convert nucleotide sequence to amino acids*
    • *Requires sequence to be able split into codons (length divisible by 3)

 

orf_finder.py

Program that finds Open Reading Frames in a DNA sequence.

Usage

orf_finder.py -f [fasta] -m [min_size]

Required Arguments

  • -f / --file: FASTA file containing DNA sequence
  • -m / --minbp: Minimum base-pair length for ORF's

Optional Arguments

  • -n / --nested: Include nested ORF's

Output

  • Tab Separated Values containing Start Index, End Index, and Frame

 

gene_finder.py

Program that locates ORF's then predicts Genes in a DNA sequence

Usage

gene_finder.py -f [fasta] -m [min_size]

Required Arguments

  • -f / --file: FASTA file containing DNA sequence
  • -m / --minbp: Minimum base-pair length for ORF's

Optional Arguments

  • -n / --nested: Include nested Genes

Output

  • Tab Separated Values containing Gene Label, Forward/Reverse Strand, Frame, Start Index, End Index, and Amino Acid Sequence

About

A collection of useful algorithms and scripts for bioinformatics and genomics research.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages