Download SequenceBenchmark.zip
from our Zenodo, and extract it into the top level of this project (such that the contained Data
and Result
folders are on the same level as the Code
and Pipes
folders).
Recreate our conda environment using the environment.yml
file:
conda env create -f environment.yml
This creates a sequencebenchmark
conda environment, which needs to be activated before running the predictions:
conda activate sequencebenchmark
Running a prediction task is as simple as supplying the target_name
as a command line argument to snakemake
(see below for a list of available targets).
However, as you will most likely have to run this on a cluster things might get a little more complicated than that, as each cluster is set up differently.
For reference, here is a small example script which could be used for a cluster running Slurm:
#!/bin/bash
# define cluster command for snakemake
SBATCH_CMD="sbatch \
--nodes=1 \
--ntasks={resources.ntasks} \
--cpus-per-task={threads} \
--mem={resources.mem_mb}M \
--gres=gpu:{resources.gpu} \
--parsable \
--requeue \
--output=\"logs/$JOBNAME-%A.out\" \
--job-name=sequence_expression_benchmark"
# run snakemake with said cluster command
snakemake \
--keep-going \
--default-resources ntasks=1 mem_mb=1000 gpu=0 \
--cluster "${SBATCH_CMD}" \
--cores 64 \
--jobs 64 \
--latency-wait 180 \
$@
Note that number of tasks, CPUs/threads, RAM and GPUs are passed to the cluster command via {resources.ntasks}
, {threads}
, {resources.mem_mb}
and {resources.gpu}
.
To find out the appropriate arguments needed to run this in your specific case, please refer to the Snakemake documentation and/or ask your sysadmin.
After the prediction task has finished, a <dataset_name>-<model_name>-latest_results.tsv
link will be made in the Results
folder. This file is then used in the analysis notebook.
If you can't run the pipeline yourself but need the prediction file for a certain target, please contact us.
Snakefile
currently contains the following targets to generate predictions, each of which corresponds to a different dataset, and if available, a certain model (in case no model is specified, Enformer is used):
segal_promoters_<model>
for<model>
:enformer
,basenji1
andbasenji2
cohen_tripseq_<model>
for<model>
:enformer
,basenji1
andbasenji2
cohen_patchmpra
findlay_brca
bergmann_exp_<model>
for<model>
:enformer
andbasenji2
bergmann_promoteronly
bergmann_enhancercentered
kircher_ingenome_<model>
for<model>
:enformer
,basenji1
andbasenji2
tss_sim_<model>
for<model>
:enformer
,basenji1
andbasenji2
fulco_crispri
avsec_fulltable
avsec_fulltable_fixed
avsec_enhancercentered_<model>
for<model>
:enformer
andbasenji2
segal_ism
gtex_eqtl_at_tss_<model>
for<model>
:enformer
andbasenji2
ful_gas_localeffects
fulco_in_fulco
Data
: data needed as input for generating our samples etc.environment.yml
: file to reproduce the pipeline conda environmentEnformer_experiments.ipynb
: notebook containing all analyses from the main textEnhancer_shift.ipynb
: notebook containing all analyses pertaining to the in-silico enhancer shiftGTEX_manual_match.ipynb
: notebook containing the analyses for Additional File 2.Track_file_prep.ipynb
: notebook used to generate track filesPipes
: pipeline dataSnakefile
: the file defining all pipeline stepsconfig
: configuration files, contains only a single YAML file describing paths to genome files, prediction tracks used for a sample generator, and the number of jobs to split a dataset into.scripts
: folder for tiny helper scriptspickles
: output directory, datasets are split and pickled into job files before the prediction, those pickles are placed here.predictions
: output directory, predictions generated from the job pickle splits are placed hereresult
: output directory, tsvs assembled from prediction splits are placed here
Results
: final directory into which results get copied
The pipeline output directories each contain subdirectories named after the corresponding dataset and the used model, e.g. segal_promoters/basenji2/
.