CUT&RUNTools

This version of CUT&RUNTools has been adapted to run on SGE and has enhanced parallelization.

Please see this google doc for relevant running instructions on the Stein server.

CUT&RUNTools

This package contains the pipeline for conducting a CUT&RUN analysis. The pipeline comprises of read trimming, alignment steps, motif finding steps, and finally the motif footprinting step.

To get started, please see INSTALL.md about how to set up the pipeline.

Once the package is installed, please see USAGE.md about usage. Basic usage is provided below.

If you run into issues and would like to report them, you can use the "Issues" tab on the left hand side. Alternatively, you can contact me or one of corresponding authors: zqian{at}jimmy.harvard.edu, stuart_orkin{at}dfci.harvard.edu, or gcyuan{at}jimmy.harvard.edu.

Read our paper. Sign up for cutruntools Google group mailing list to ask questions, receive updates.

CUT&RUNTools is released under GPL license version 2.

What's New

8/19/19: Added support for SEACR, support for analysis involving larger fragments (>120bp). Changes are in step 2.

Basic Usage

When first starting an analysis, CUT&RUNTools requires a JSON configuration file (named config.json) to be written. This file specifies all that is needed to run an analysis. A sample configuration file is below.

{
    "Rscriptbin": "/users/j/r/jrboyd/bin",
    "pythonbin": "/users/j/r/jrboyd/anaconda3/envs/py2_env/bin",
    "perlbin": "/users/j/r/jrboyd/anaconda3/bin",
    "javabin": "/usr/bin",
    "trimmomaticbin": "/users/j/r/jrboyd/tools/cutruntools_dep/Trimmomatic-0.36",
    "trimmomaticjarfile": "trimmomatic-0.36.jar",
    "bowtie2bin": "/users/j/r/jrboyd/tools/cutruntools_dep/bowtie2-2.2.9",
    "samtoolsbin": "/users/j/r/jrboyd/tools/cutruntools_dep/samtools-1.3.1",
    "adapterpath": "/users/j/r/jrboyd/tools/cutruntools/adapters",
    "picardbin": "/users/j/r/jrboyd/tools/cutruntools_dep/picard-2.8.0",
    "picardjarfile": "picard.jar",
    "macs2bin": "/users/j/r/jrboyd/anaconda3/envs/py2_env/bin",
    "macs2pythonlib": "/users/j/r/jrboyd/anaconda3/envs/py2_env/lib/python2.7/site-packages",
    "kseqbin": "/users/j/r/jrboyd/tools/cutruntools",
    "memebin": "/users/j/r/jrboyd/anaconda3/bin",
    "bedopsbin": "/users/j/r/jrboyd/tools/cutruntools_dep/bedops_v2.4.39/bin",
    "bedtoolsbin": "/users/j/r/jrboyd/bin/bedtools2-2.20.1/bin",
    "makecutmatrixbin": "/users/j/r/jrboyd/.local/bin",
    "bt2idx": "/users/j/r/jrboyd/tools/cutruntools/bowtie2_indexes",
    "genome_sequence": "/users/j/r/jrboyd/data/hg38/seq/hg38canon.fa",
    "extratoolsbin": "/users/j/r/jrboyd/tools/cutruntools",
    "extrasettings": "/users/j/r/jrboyd/tools/cutruntools",
    "input/output": {
        "fastq_directory": "/users/j/r/jrboyd/pipelines/cutruntools/input_fastq_bcell_113020",
        "workdir": "/users/j/r/jrboyd/pipelines/cutruntools/output_bcell_113020",
        "fastq_sequence_length": 85,
        "organism_build": "hg38"
    },
    "motif_finding": {
        "num_bp_from_summit": 150,
        "num_peaks": 1000,
        "total_peaks": 5000,
        "motif_scanning_pval": 0.0005,
        "num_motifs": 10
    },
    "cluster": {
        "email": "jrboyd@med.uvm.edu",
        "step_alignment": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        },
        "step_process_bam": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        },
        "step_motif_find": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        },
        "step_footprinting": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        }
    }
}

The first 10-20 lines define the software paths for various sequencing tools. For each software, please note the directory to the binary executable (usually the bin directory, such as /usr/bin, /usr/local/bin, or sometimes it may depend on how softwares are installed on the cluster). For more information, refer to INSTALL.md.

Configuring Python and Perl: The version of Python that is indicated in pythonbin field should be consistent with the version of Python used to install MACS2, and version used to install make_cut_matrix (makecutmatrixbin). Similarly for Perl (perlbin) and meme-chip (memebin). The version of Perl used to install meme should be used for perlbin field.

Next, check that the software prerequisites are met.

./validate.py config.json --ignore-input-output --software

Start an analysis

Put sample fastq files (R1_001.fastq.gz, and R2_001.fastq.gz suffix) in a directory (in my case /n/scratch2/qz64/Nan_18_aug23/Nan_run_19).

Next, modify the following lines in the config.json file: adapterpath (line 8), bt2idx (line 17); genome_sequence (line 18); section input/output: fastq_directory (line 22), workdir (line 23), fastq_sequence_length (line 24), organism_build (line 25); section cluster: email (line 35).

fastq_directory is the directory containing paired-end CUT&RUN sequences (with _R1_001.fastq.gz and _R2_001.fastq.gz suffix).
workdir is the output directory, where results are stored.
organism_build is one of supported genome assemblies: hg38, hg19, mm10, and mm9.
adapterpath contains Illumina Truseq3-PE adapter sequences (we provide them).
genome_sequence is the whole-genome masked sequence which matches with the appropriate organism build. See INSTALL.md for details.

Create job submission scripts

./validate.py config.json

This script checks that your configuration file is correct and all paths are correct. You will get an empty line if the validate.py runs without errors.

./create_scripts.py config.json

This creates a set of Slurm job-submission scripts based on the configuration file above (named config.json) in the workdir directory. The scripts can be directly, easily executed.

Easy 1-step execution

The previous step created a working directory and deposited a set of submission scripts. We can now use those scripts to start the analysis on a sample.

cd /n/scratch2/qz64/workdir
./integrated.all.steps.sh GATA1_D7_30min_chr11_R1_001.fastq.gz

Note the single parameter the input file. There are supposed to be two files per input sample (R1_001.fastq.gz, R2_001.fastq.gz). Please just use the R1 fastq as input, and do not specify both files. The toolkit is smart enough to use the filename to automatically look for the R2 file.

What happens next is that CUT&RUNTools will sequentially run the analysis pipeline which consists of 4 major steps (for details about the steps, see option 2 below). By entering squeue -u you will see 5 scripts submitted to slurm, but they will be executed sequentially one after the other via a dependency.

Option 2: Manual four-step process

Optionally, users can run the analysis steps individually to have more control over the steps of the pipeline.

Step 1. Read trimming, alignment. We suppose the workdir is defined as /n/scratch2/qz64/workdir

cd /n/scratch2/qz64/workdir
sbatch ./integrated.sh CR_BCL11A_W9_r1_S17_R1_001.fastq.gz

The parameter is the fastq file. Even though we specify the _R1_001.fastq.gz, CUT&RUNTools actually checks that both forward and reverse fastq files are present. Always use the _R1_001 of the pair as parameter of this command.

Step 2. BAM processing, peak calling. It marks duplicates in bam files, and filter fragments by size. It then performs peak calling using both MACS2 and SEACR. Based on the results, users can choose to proceed with either MACS2 or SEACR's results.

cd aligned.aug10
sbatch ./integrated.step2.sh CR_BCL11A_W9_r1_S17_aligned_reads.bam

After this step, CUT&RUNTools has varied through different peak calling settings and generated multiple results for these settings in the following directories. Users need to select only one to go to steps 3 & 4.

Result Directory (in `../`)	Tool	Config	Fragments used	Use duplicates (y/n)
macs2.narrow.aug18	MACS2	narrowPeak	<120bp	y
macs2.broad.aug18	MACS2	broadPeak	<120bp	y
macs2.narrow.all.frag.aug18	MACS2	narrowPeak	all	y
macs2.broad.all.frag.aug18	MACS2	broadPeak	all	y
macs2.narrow.aug18.dedup	MACS2	narrowPeak	<120bp	n
macs2.broad.aug18.dedup	MACS2	broadPeak	<120bp	n
macs2.narrow.all.frag.aug18.dedup	MACS2	narrowPeak	all	n
macs2.broad.all.frag.aug18.dedup	MACS2	broadPeak	all	n
seacr.aug12	SEACR	stringent	<120bp	y
seacr.aug12.all.frag	SEACR	stringent	all	y
seacr.aug12.dedup	SEACR	stringent	<120bp	n
seacr.aug12.all.frag.dedup	SEACR	stringent	all	n

Which directory to use: if TF CUT&RUN, I prefer macs2.narrow.aug18 or macs2.narrow.aug18.dedup. If histone CUT&RUN, use macs2.broad.all.frag.aug18. If SEACR, use seacr.aug12.all.frag (histone) or seacr.aug12 (TF) and use the stringent peaks within each folder.
Large fragment (>120bp) peaks:
- use the peak calling directories that end in *.all.frag (e.g. macs2.broad.all.frag.aug18).

Step 3. Motif finding. CUT&RUNTools uses MEME-chip for de novo motif finding on sequences surrounding the peak summits.

#Use macs2.narrow.aug18 or any of the peak calling result directory in the above table
cd ../macs2.narrow.aug18
#For narrow setting, the peak file ends in .narrowPeak
#For broad setting, the peak file ends in .broadPeak
#For seacr setting, the peak file ends in .stringent.sort.bed
#Use the right peak file accordingly
sbatch ./integrate.motif.find.sh CR_BCL11A_W9_r1_S17_aligned_reads_peaks.narrowPeak

Similar procedure applies in other peak calling directories for broadPeaks, all fragment results, or SEACR peaks.

#SEACR
cd ../seacr.aug12.all.frag
sbatch ./integrate.motif.find.sh GATA1_D7_30min_chr11_aligned_reads_treat.stringent.sort.bed

Step 4. Motif footprinting.

cd ../macs2.narrow.aug18
#supports file type narrowPeak, broadPeak, or stringent.sort.bed (SEACR)
sbatch ./integrate.footprinting.sh CR_BCL11A_W9_r1_S17_aligned_reads_peaks.narrowPeak

Beautiful footprinting figures will be located in the directory fimo.result. Footprinting figures are created for every motif found by MEME-chip, but only the right motif (associated with TF) will have a proper looking shape. Users can scan through all the motifs' footprints.

Outputs

Please see USAGE.md for more details for the output files.

Briefly, CUT&RUNTools will generate the following directory structure in the output folder.

+ macs2.narrow.aug18
  + random.10000
    + MEME_GATA1_HDP2_30min_S13_aligned_reads_shuf
      ++ summary.tsv
  + fimo.result
    + GATA1_HDP2_30min_S13_aligned_reads_peaks
      + fimo2.DREME-10.CCWATCAG
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-11.CTCCWCCC
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-12.CACGTG
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-13.CAKCTGB
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-14.CWGATA
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-1.HGATAA
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-20.CCCTYCC
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed

Files are denoted by "++", and directories are denoted by "+". Here the sample name is called GATA1_HDP2_30min_S13. The summary.tsv lists all the motif found by motif searching analysis. The files within "fimo2..." folder shows the motif footprinting figure (.png), the motif sites for the motif (.bed), and the logratio binding scores at these sites (.logratio.txt).

Single locus cut profile

CUT&RUNTools allows users to obtain a single nucleotide resolution cut profile for a region of interest.

For more details, please see USAGE.md.

Footprinting for a user-specified motif

CUT&RUNTools can run the motif scanning and motif footprinting step on a user-specified motif, such as a motif from the public JASPAR database. The motif should be in the MEME format.

The script that you will want to look at is generate.footprinting.factor.specific.centipede.py that is in the directory macs2.narrow.aug18 or macs2.narrow.aug18.dedup.

To do this analysis:

pwd
/n/scratch2/qz64/workdir
cd macs2.narrow.aug18
./generate.footprinting.factor.specific.centipede.py -b MA.00001.agataa.meme -p 0.001 -n GATA1

Option -b specifies the MEME file. Option -p is the motif scanning p-value (recommended 0.0005, but for this example we will use 0.001 since GATA1 motif is quite short). Option -n is the name you give it.

The script will generate a custom script for GATA1 motif, called integrate.footprinting.GATA1.centipede.sh. With this script, then you can run it on a narrowPeak file as follows:

sbatch ./integrate.footprinting.GATA1.centipede.sh GATA1_HDP2_30min_S13_aligned_reads_peaks.narrowPeak

The output will be in the fimo.GATA1.result folder, with the cut frequency estimated, and log odds binding score estimated.

Want to try?

Download a small CUT&RUN experiment qzhudfci/datasets/src. This is GATA1 chr11 only. Follow along the instructions INSTALL.md, USAGE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
adapters		adapters
aligned.aug10		aligned.aug10
assemblies		assemblies
configs		configs
macs2.narrow.aug18.dedup		macs2.narrow.aug18.dedup
macs2.narrow.aug18		macs2.narrow.aug18
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE.md		LICENSE.md
README.md		README.md
SEACR_1.1.R		SEACR_1.1.R
SEACR_1.1.sh		SEACR_1.1.sh
USAGE.md		USAGE.md
assemblies.install		assemblies.install
assemblies.install.hg19		assemblies.install.hg19
assemblies.install.hg38		assemblies.install.hg38
assemblies.install.mm10		assemblies.install.mm10
atactk.install.sh		atactk.install.sh
blacklist.readme		blacklist.readme
change.bdg.py		change.bdg.py
check_coordinate.py		check_coordinate.py
create_scripts.py		create_scripts.py
extra_macs2.sh		extra_macs2.sh
filter_below.awk		filter_below.awk
generate.py		generate.py
get_summits_broadPeak.py		get_summits_broadPeak.py
get_summits_seacr.py		get_summits_seacr.py
hg19.blacklist.bed		hg19.blacklist.bed
hg38.blacklist.bed		hg38.blacklist.bed
integrated.sh		integrated.sh
kseq.h		kseq.h
kseq_test.c		kseq_test.c
length		length
make_cut_matrix.patch		make_cut_matrix.patch
make_kseq_test.sh		make_kseq_test.sh
metrics.py.patch		metrics.py.patch
mm10.blacklist.bed		mm10.blacklist.bed
mm9.blacklist.bed		mm9.blacklist.bed
narrowPeak2bigbed.sh		narrowPeak2bigbed.sh
quantify.py		quantify.py
quantify_separate.py		quantify_separate.py
sample.bed		sample.bed
sample.result.gata1.png		sample.result.gata1.png
single.locus.visualized.UCSC.png		single.locus.visualized.UCSC.png
ucsc-tools.install		ucsc-tools.install
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUT&RUNTools

What's New

Basic Usage

Start an analysis

Create job submission scripts

Easy 1-step execution

Option 2: Manual four-step process

Outputs

Single locus cut profile

Footprinting for a user-specified motif

Want to try?

About

Releases

Packages

Languages

License

FrietzeLabUVM/cutruntools

Folders and files

Latest commit

History

Repository files navigation

CUT&RUNTools

What's New

Basic Usage

Start an analysis

Create job submission scripts

Easy 1-step execution

Option 2: Manual four-step process

Outputs

Single locus cut profile

Footprinting for a user-specified motif

Want to try?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages