This version of CUT&RUNTools has been adapted to run on SGE and has enhanced parallelization.
Please see this google doc for relevant running instructions on the Stein server.
This package contains the pipeline for conducting a CUT&RUN analysis. The pipeline comprises of read trimming, alignment steps, motif finding steps, and finally the motif footprinting step.
To get started, please see INSTALL.md about how to set up the pipeline.
Once the package is installed, please see USAGE.md about usage. Basic usage is provided below.
If you run into issues and would like to report them, you can use the "Issues" tab on the left hand side. Alternatively, you can contact me or one of corresponding authors: zqian{at}jimmy.harvard.edu, stuart_orkin{at}dfci.harvard.edu, or gcyuan{at}jimmy.harvard.edu.
Read our paper. Sign up for cutruntools Google group mailing list to ask questions, receive updates.
CUT&RUNTools is released under GPL license version 2.
8/19/19: Added support for SEACR, support for analysis involving larger fragments (>120bp). Changes are in step 2.
When first starting an analysis, CUT&RUNTools requires a JSON configuration file (named config.json
) to be written. This file specifies all that is needed to run an analysis. A sample configuration file is below.
{
"Rscriptbin": "/users/j/r/jrboyd/bin",
"pythonbin": "/users/j/r/jrboyd/anaconda3/envs/py2_env/bin",
"perlbin": "/users/j/r/jrboyd/anaconda3/bin",
"javabin": "/usr/bin",
"trimmomaticbin": "/users/j/r/jrboyd/tools/cutruntools_dep/Trimmomatic-0.36",
"trimmomaticjarfile": "trimmomatic-0.36.jar",
"bowtie2bin": "/users/j/r/jrboyd/tools/cutruntools_dep/bowtie2-2.2.9",
"samtoolsbin": "/users/j/r/jrboyd/tools/cutruntools_dep/samtools-1.3.1",
"adapterpath": "/users/j/r/jrboyd/tools/cutruntools/adapters",
"picardbin": "/users/j/r/jrboyd/tools/cutruntools_dep/picard-2.8.0",
"picardjarfile": "picard.jar",
"macs2bin": "/users/j/r/jrboyd/anaconda3/envs/py2_env/bin",
"macs2pythonlib": "/users/j/r/jrboyd/anaconda3/envs/py2_env/lib/python2.7/site-packages",
"kseqbin": "/users/j/r/jrboyd/tools/cutruntools",
"memebin": "/users/j/r/jrboyd/anaconda3/bin",
"bedopsbin": "/users/j/r/jrboyd/tools/cutruntools_dep/bedops_v2.4.39/bin",
"bedtoolsbin": "/users/j/r/jrboyd/bin/bedtools2-2.20.1/bin",
"makecutmatrixbin": "/users/j/r/jrboyd/.local/bin",
"bt2idx": "/users/j/r/jrboyd/tools/cutruntools/bowtie2_indexes",
"genome_sequence": "/users/j/r/jrboyd/data/hg38/seq/hg38canon.fa",
"extratoolsbin": "/users/j/r/jrboyd/tools/cutruntools",
"extrasettings": "/users/j/r/jrboyd/tools/cutruntools",
"input/output": {
"fastq_directory": "/users/j/r/jrboyd/pipelines/cutruntools/input_fastq_bcell_113020",
"workdir": "/users/j/r/jrboyd/pipelines/cutruntools/output_bcell_113020",
"fastq_sequence_length": 85,
"organism_build": "hg38"
},
"motif_finding": {
"num_bp_from_summit": 150,
"num_peaks": 1000,
"total_peaks": 5000,
"motif_scanning_pval": 0.0005,
"num_motifs": 10
},
"cluster": {
"email": "jrboyd@med.uvm.edu",
"step_alignment": {
"queue": "bluemoon",
"memory": 32000,
"time_limit": "0-12:00"
},
"step_process_bam": {
"queue": "bluemoon",
"memory": 32000,
"time_limit": "0-12:00"
},
"step_motif_find": {
"queue": "bluemoon",
"memory": 32000,
"time_limit": "0-12:00"
},
"step_footprinting": {
"queue": "bluemoon",
"memory": 32000,
"time_limit": "0-12:00"
}
}
}
The first 10-20 lines define the software paths for various sequencing tools. For each software, please note the directory to the binary executable (usually the bin
directory, such as /usr/bin
, /usr/local/bin
, or sometimes it may depend on how softwares are installed on the cluster). For more information, refer to INSTALL.md.
Configuring Python and Perl: The version of Python that is indicated in pythonbin
field should be consistent with the version of Python used to install MACS2, and version used to install make_cut_matrix (makecutmatrixbin
).
Similarly for Perl (perlbin
) and meme-chip (memebin
). The version of Perl used to install meme should be used for perlbin
field.
Next, check that the software prerequisites are met.
./validate.py config.json --ignore-input-output --software
Put sample fastq files (R1_001.fastq.gz, and R2_001.fastq.gz suffix) in a directory (in my case /n/scratch2/qz64/Nan_18_aug23/Nan_run_19
).
Next, modify the following lines in the config.json
file: adapterpath (line 8), bt2idx (line 17); genome_sequence (line 18);
section input/output: fastq_directory (line 22), workdir (line 23), fastq_sequence_length (line 24),
organism_build (line 25); section cluster: email (line 35).
fastq_directory
is the directory containing paired-end CUT&RUN sequences (with _R1_001.fastq.gz and _R2_001.fastq.gz suffix).workdir
is the output directory, where results are stored.organism_build
is one of supported genome assemblies: hg38, hg19, mm10, and mm9.adapterpath
contains Illumina Truseq3-PE adapter sequences (we provide them).genome_sequence
is the whole-genome masked sequence which matches with the appropriate organism build. See INSTALL.md for details.
./validate.py config.json
This script checks that your configuration file is correct and all paths are correct. You will get an empty line if the validate.py runs without errors.
./create_scripts.py config.json
This creates a set of Slurm job-submission scripts based on the configuration file above (named config.json) in the workdir
directory. The scripts can be directly, easily executed.
The previous step created a working directory and deposited a set of submission scripts. We can now use those scripts to start the analysis on a sample.
cd /n/scratch2/qz64/workdir
./integrated.all.steps.sh GATA1_D7_30min_chr11_R1_001.fastq.gz
Note the single parameter the input file. There are supposed to be two files per input sample (R1_001.fastq.gz, R2_001.fastq.gz). Please just use the R1 fastq as input, and do not specify both files. The toolkit is smart enough to use the filename to automatically look for the R2 file.
What happens next is that CUT&RUNTools will sequentially run the analysis pipeline which consists of 4 major steps (for details about the steps, see option 2 below). By entering squeue -u you will see 5 scripts submitted to slurm, but they will be executed sequentially one after the other via a dependency.
Optionally, users can run the analysis steps individually to have more control over the steps of the pipeline.
Step 1. Read trimming, alignment. We suppose the workdir
is defined as /n/scratch2/qz64/workdir
cd /n/scratch2/qz64/workdir
sbatch ./integrated.sh CR_BCL11A_W9_r1_S17_R1_001.fastq.gz
The parameter is the fastq file. Even though we specify the _R1_001.fastq.gz, CUT&RUNTools actually checks that both forward and reverse fastq files are present. Always use the _R1_001 of the pair as parameter of this command.
Step 2. BAM processing, peak calling. It marks duplicates in bam files, and filter fragments by size. It then performs peak calling using both MACS2 and SEACR. Based on the results, users can choose to proceed with either MACS2 or SEACR's results.
cd aligned.aug10
sbatch ./integrated.step2.sh CR_BCL11A_W9_r1_S17_aligned_reads.bam
After this step, CUT&RUNTools has varied through different peak calling settings and generated multiple results for these settings in the following directories. Users need to select only one to go to steps 3 & 4.
Result Directory (in ../ ) |
Tool | Config | Fragments used | Use duplicates (y/n) |
---|---|---|---|---|
macs2.narrow.aug18 | MACS2 | narrowPeak | <120bp | y |
macs2.broad.aug18 | MACS2 | broadPeak | <120bp | y |
macs2.narrow.all.frag.aug18 | MACS2 | narrowPeak | all | y |
macs2.broad.all.frag.aug18 | MACS2 | broadPeak | all | y |
macs2.narrow.aug18.dedup | MACS2 | narrowPeak | <120bp | n |
macs2.broad.aug18.dedup | MACS2 | broadPeak | <120bp | n |
macs2.narrow.all.frag.aug18.dedup | MACS2 | narrowPeak | all | n |
macs2.broad.all.frag.aug18.dedup | MACS2 | broadPeak | all | n |
seacr.aug12 | SEACR | stringent | <120bp | y |
seacr.aug12.all.frag | SEACR | stringent | all | y |
seacr.aug12.dedup | SEACR | stringent | <120bp | n |
seacr.aug12.all.frag.dedup | SEACR | stringent | all | n |
-
Which directory to use: if TF CUT&RUN, I prefer macs2.narrow.aug18 or macs2.narrow.aug18.dedup. If histone CUT&RUN, use macs2.broad.all.frag.aug18. If SEACR, use seacr.aug12.all.frag (histone) or seacr.aug12 (TF) and use the stringent peaks within each folder.
-
Large fragment (>120bp) peaks:
- use the peak calling directories that end in
*.all.frag
(e.g. macs2.broad.all.frag.aug18).
- use the peak calling directories that end in
Step 3. Motif finding. CUT&RUNTools uses MEME-chip for de novo motif finding on sequences surrounding the peak summits.
#Use macs2.narrow.aug18 or any of the peak calling result directory in the above table
cd ../macs2.narrow.aug18
#For narrow setting, the peak file ends in .narrowPeak
#For broad setting, the peak file ends in .broadPeak
#For seacr setting, the peak file ends in .stringent.sort.bed
#Use the right peak file accordingly
sbatch ./integrate.motif.find.sh CR_BCL11A_W9_r1_S17_aligned_reads_peaks.narrowPeak
Similar procedure applies in other peak calling directories for broadPeaks, all fragment results, or SEACR peaks.
#SEACR
cd ../seacr.aug12.all.frag
sbatch ./integrate.motif.find.sh GATA1_D7_30min_chr11_aligned_reads_treat.stringent.sort.bed
Step 4. Motif footprinting.
cd ../macs2.narrow.aug18
#supports file type narrowPeak, broadPeak, or stringent.sort.bed (SEACR)
sbatch ./integrate.footprinting.sh CR_BCL11A_W9_r1_S17_aligned_reads_peaks.narrowPeak
Beautiful footprinting figures will be located in the directory fimo.result
. Footprinting figures are created for every motif found by MEME-chip, but only the right motif (associated with TF) will have a proper looking shape. Users can scan through all the motifs' footprints.
Please see USAGE.md for more details for the output files.
Briefly, CUT&RUNTools will generate the following directory structure in the output folder.
+ macs2.narrow.aug18
+ random.10000
+ MEME_GATA1_HDP2_30min_S13_aligned_reads_shuf
++ summary.tsv
+ fimo.result
+ GATA1_HDP2_30min_S13_aligned_reads_peaks
+ fimo2.DREME-10.CCWATCAG
++ fimo.png
++ fimo.logratio.txt
++ fimo.bed
+ fimo2.DREME-11.CTCCWCCC
++ fimo.png
++ fimo.logratio.txt
++ fimo.bed
+ fimo2.DREME-12.CACGTG
++ fimo.png
++ fimo.logratio.txt
++ fimo.bed
+ fimo2.DREME-13.CAKCTGB
++ fimo.png
++ fimo.logratio.txt
++ fimo.bed
+ fimo2.DREME-14.CWGATA
++ fimo.png
++ fimo.logratio.txt
++ fimo.bed
+ fimo2.DREME-1.HGATAA
++ fimo.png
++ fimo.logratio.txt
++ fimo.bed
+ fimo2.DREME-20.CCCTYCC
++ fimo.png
++ fimo.logratio.txt
++ fimo.bed
Files are denoted by "++", and directories are denoted by "+". Here the sample name is called GATA1_HDP2_30min_S13. The summary.tsv lists all the motif found by motif searching analysis. The files within "fimo2..." folder shows the motif footprinting figure (.png), the motif sites for the motif (.bed), and the logratio binding scores at these sites (.logratio.txt).
CUT&RUNTools allows users to obtain a single nucleotide resolution cut profile for a region of interest.
For more details, please see USAGE.md.
CUT&RUNTools can run the motif scanning and motif footprinting step on a user-specified motif, such as a motif from the public JASPAR database. The motif should be in the MEME format.
The script that you will want to look at is generate.footprinting.factor.specific.centipede.py
that is in the directory macs2.narrow.aug18
or macs2.narrow.aug18.dedup
.
To do this analysis:
pwd
/n/scratch2/qz64/workdir
cd macs2.narrow.aug18
./generate.footprinting.factor.specific.centipede.py -b MA.00001.agataa.meme -p 0.001 -n GATA1
Option -b specifies the MEME file. Option -p is the motif scanning p-value (recommended 0.0005, but for this example we will use 0.001 since GATA1 motif is quite short). Option -n is the name you give it.
The script will generate a custom script for GATA1 motif, called integrate.footprinting.GATA1.centipede.sh
. With this script, then you can run it on a narrowPeak file as follows:
sbatch ./integrate.footprinting.GATA1.centipede.sh GATA1_HDP2_30min_S13_aligned_reads_peaks.narrowPeak
The output will be in the fimo.GATA1.result
folder, with the cut frequency estimated, and log odds binding score estimated.
Download a small CUT&RUN experiment qzhudfci/datasets/src. This is GATA1 chr11 only. Follow along the instructions INSTALL.md, USAGE.md.