Skip to content

FrietzeLabUVM/cutruntools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This version of CUT&RUNTools has been adapted to run on SGE and has enhanced parallelization.

Please see this google doc for relevant running instructions on the Stein server.

CUT&RUNTools

This package contains the pipeline for conducting a CUT&RUN analysis. The pipeline comprises of read trimming, alignment steps, motif finding steps, and finally the motif footprinting step.

To get started, please see INSTALL.md about how to set up the pipeline.

Once the package is installed, please see USAGE.md about usage. Basic usage is provided below.

If you run into issues and would like to report them, you can use the "Issues" tab on the left hand side. Alternatively, you can contact me or one of corresponding authors: zqian{at}jimmy.harvard.edu, stuart_orkin{at}dfci.harvard.edu, or gcyuan{at}jimmy.harvard.edu.

Read our paper. Sign up for cutruntools Google group mailing list to ask questions, receive updates.

CUT&RUNTools is released under GPL license version 2.

What's New

8/19/19: Added support for SEACR, support for analysis involving larger fragments (>120bp). Changes are in step 2.

Basic Usage

When first starting an analysis, CUT&RUNTools requires a JSON configuration file (named config.json) to be written. This file specifies all that is needed to run an analysis. A sample configuration file is below.

{
    "Rscriptbin": "/users/j/r/jrboyd/bin",
    "pythonbin": "/users/j/r/jrboyd/anaconda3/envs/py2_env/bin",
    "perlbin": "/users/j/r/jrboyd/anaconda3/bin",
    "javabin": "/usr/bin",
    "trimmomaticbin": "/users/j/r/jrboyd/tools/cutruntools_dep/Trimmomatic-0.36",
    "trimmomaticjarfile": "trimmomatic-0.36.jar",
    "bowtie2bin": "/users/j/r/jrboyd/tools/cutruntools_dep/bowtie2-2.2.9",
    "samtoolsbin": "/users/j/r/jrboyd/tools/cutruntools_dep/samtools-1.3.1",
    "adapterpath": "/users/j/r/jrboyd/tools/cutruntools/adapters",
    "picardbin": "/users/j/r/jrboyd/tools/cutruntools_dep/picard-2.8.0",
    "picardjarfile": "picard.jar",
    "macs2bin": "/users/j/r/jrboyd/anaconda3/envs/py2_env/bin",
    "macs2pythonlib": "/users/j/r/jrboyd/anaconda3/envs/py2_env/lib/python2.7/site-packages",
    "kseqbin": "/users/j/r/jrboyd/tools/cutruntools",
    "memebin": "/users/j/r/jrboyd/anaconda3/bin",
    "bedopsbin": "/users/j/r/jrboyd/tools/cutruntools_dep/bedops_v2.4.39/bin",
    "bedtoolsbin": "/users/j/r/jrboyd/bin/bedtools2-2.20.1/bin",
    "makecutmatrixbin": "/users/j/r/jrboyd/.local/bin",
    "bt2idx": "/users/j/r/jrboyd/tools/cutruntools/bowtie2_indexes",
    "genome_sequence": "/users/j/r/jrboyd/data/hg38/seq/hg38canon.fa",
    "extratoolsbin": "/users/j/r/jrboyd/tools/cutruntools",
    "extrasettings": "/users/j/r/jrboyd/tools/cutruntools",
    "input/output": {
        "fastq_directory": "/users/j/r/jrboyd/pipelines/cutruntools/input_fastq_bcell_113020",
        "workdir": "/users/j/r/jrboyd/pipelines/cutruntools/output_bcell_113020",
        "fastq_sequence_length": 85,
        "organism_build": "hg38"
    },
    "motif_finding": {
        "num_bp_from_summit": 150,
        "num_peaks": 1000,
        "total_peaks": 5000,
        "motif_scanning_pval": 0.0005,
        "num_motifs": 10
    },
    "cluster": {
        "email": "jrboyd@med.uvm.edu",
        "step_alignment": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        },
        "step_process_bam": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        },
        "step_motif_find": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        },
        "step_footprinting": {
            "queue": "bluemoon",
            "memory": 32000,
            "time_limit": "0-12:00"
        }
    }
}

The first 10-20 lines define the software paths for various sequencing tools. For each software, please note the directory to the binary executable (usually the bin directory, such as /usr/bin, /usr/local/bin, or sometimes it may depend on how softwares are installed on the cluster). For more information, refer to INSTALL.md.

Configuring Python and Perl: The version of Python that is indicated in pythonbin field should be consistent with the version of Python used to install MACS2, and version used to install make_cut_matrix (makecutmatrixbin). Similarly for Perl (perlbin) and meme-chip (memebin). The version of Perl used to install meme should be used for perlbin field.

Next, check that the software prerequisites are met.

./validate.py config.json --ignore-input-output --software

Start an analysis

Put sample fastq files (R1_001.fastq.gz, and R2_001.fastq.gz suffix) in a directory (in my case /n/scratch2/qz64/Nan_18_aug23/Nan_run_19).

Next, modify the following lines in the config.json file: adapterpath (line 8), bt2idx (line 17); genome_sequence (line 18); section input/output: fastq_directory (line 22), workdir (line 23), fastq_sequence_length (line 24), organism_build (line 25); section cluster: email (line 35).

  • fastq_directory is the directory containing paired-end CUT&RUN sequences (with _R1_001.fastq.gz and _R2_001.fastq.gz suffix).
  • workdir is the output directory, where results are stored.
  • organism_build is one of supported genome assemblies: hg38, hg19, mm10, and mm9.
  • adapterpath contains Illumina Truseq3-PE adapter sequences (we provide them).
  • genome_sequence is the whole-genome masked sequence which matches with the appropriate organism build. See INSTALL.md for details.

Create job submission scripts

./validate.py config.json

This script checks that your configuration file is correct and all paths are correct. You will get an empty line if the validate.py runs without errors.

./create_scripts.py config.json

This creates a set of Slurm job-submission scripts based on the configuration file above (named config.json) in the workdir directory. The scripts can be directly, easily executed.

Easy 1-step execution

The previous step created a working directory and deposited a set of submission scripts. We can now use those scripts to start the analysis on a sample.

cd /n/scratch2/qz64/workdir
./integrated.all.steps.sh GATA1_D7_30min_chr11_R1_001.fastq.gz

Note the single parameter the input file. There are supposed to be two files per input sample (R1_001.fastq.gz, R2_001.fastq.gz). Please just use the R1 fastq as input, and do not specify both files. The toolkit is smart enough to use the filename to automatically look for the R2 file.

What happens next is that CUT&RUNTools will sequentially run the analysis pipeline which consists of 4 major steps (for details about the steps, see option 2 below). By entering squeue -u you will see 5 scripts submitted to slurm, but they will be executed sequentially one after the other via a dependency.

Option 2: Manual four-step process

Optionally, users can run the analysis steps individually to have more control over the steps of the pipeline.

Step 1. Read trimming, alignment. We suppose the workdir is defined as /n/scratch2/qz64/workdir

cd /n/scratch2/qz64/workdir
sbatch ./integrated.sh CR_BCL11A_W9_r1_S17_R1_001.fastq.gz

The parameter is the fastq file. Even though we specify the _R1_001.fastq.gz, CUT&RUNTools actually checks that both forward and reverse fastq files are present. Always use the _R1_001 of the pair as parameter of this command.

Step 2. BAM processing, peak calling. It marks duplicates in bam files, and filter fragments by size. It then performs peak calling using both MACS2 and SEACR. Based on the results, users can choose to proceed with either MACS2 or SEACR's results.

cd aligned.aug10
sbatch ./integrated.step2.sh CR_BCL11A_W9_r1_S17_aligned_reads.bam

After this step, CUT&RUNTools has varied through different peak calling settings and generated multiple results for these settings in the following directories. Users need to select only one to go to steps 3 & 4.

Result Directory (in ../) Tool Config Fragments used Use duplicates (y/n)
macs2.narrow.aug18 MACS2 narrowPeak <120bp y
macs2.broad.aug18 MACS2 broadPeak <120bp y
macs2.narrow.all.frag.aug18 MACS2 narrowPeak all y
macs2.broad.all.frag.aug18 MACS2 broadPeak all y
macs2.narrow.aug18.dedup MACS2 narrowPeak <120bp n
macs2.broad.aug18.dedup MACS2 broadPeak <120bp n
macs2.narrow.all.frag.aug18.dedup MACS2 narrowPeak all n
macs2.broad.all.frag.aug18.dedup MACS2 broadPeak all n
seacr.aug12 SEACR stringent <120bp y
seacr.aug12.all.frag SEACR stringent all y
seacr.aug12.dedup SEACR stringent <120bp n
seacr.aug12.all.frag.dedup SEACR stringent all n
  • Which directory to use: if TF CUT&RUN, I prefer macs2.narrow.aug18 or macs2.narrow.aug18.dedup. If histone CUT&RUN, use macs2.broad.all.frag.aug18. If SEACR, use seacr.aug12.all.frag (histone) or seacr.aug12 (TF) and use the stringent peaks within each folder.

  • Large fragment (>120bp) peaks:

    • use the peak calling directories that end in *.all.frag (e.g. macs2.broad.all.frag.aug18).

Step 3. Motif finding. CUT&RUNTools uses MEME-chip for de novo motif finding on sequences surrounding the peak summits.

#Use macs2.narrow.aug18 or any of the peak calling result directory in the above table
cd ../macs2.narrow.aug18
#For narrow setting, the peak file ends in .narrowPeak
#For broad setting, the peak file ends in .broadPeak
#For seacr setting, the peak file ends in .stringent.sort.bed
#Use the right peak file accordingly
sbatch ./integrate.motif.find.sh CR_BCL11A_W9_r1_S17_aligned_reads_peaks.narrowPeak

Similar procedure applies in other peak calling directories for broadPeaks, all fragment results, or SEACR peaks.

#SEACR
cd ../seacr.aug12.all.frag
sbatch ./integrate.motif.find.sh GATA1_D7_30min_chr11_aligned_reads_treat.stringent.sort.bed

Step 4. Motif footprinting.

cd ../macs2.narrow.aug18
#supports file type narrowPeak, broadPeak, or stringent.sort.bed (SEACR)
sbatch ./integrate.footprinting.sh CR_BCL11A_W9_r1_S17_aligned_reads_peaks.narrowPeak

Beautiful footprinting figures will be located in the directory fimo.result. Footprinting figures are created for every motif found by MEME-chip, but only the right motif (associated with TF) will have a proper looking shape. Users can scan through all the motifs' footprints.

Outputs

Please see USAGE.md for more details for the output files.

Briefly, CUT&RUNTools will generate the following directory structure in the output folder.

+ macs2.narrow.aug18
  + random.10000
    + MEME_GATA1_HDP2_30min_S13_aligned_reads_shuf
      ++ summary.tsv
  + fimo.result
    + GATA1_HDP2_30min_S13_aligned_reads_peaks
      + fimo2.DREME-10.CCWATCAG
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-11.CTCCWCCC
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-12.CACGTG
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-13.CAKCTGB
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-14.CWGATA
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-1.HGATAA
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed
      + fimo2.DREME-20.CCCTYCC
        ++ fimo.png
        ++ fimo.logratio.txt
        ++ fimo.bed

Files are denoted by "++", and directories are denoted by "+". Here the sample name is called GATA1_HDP2_30min_S13. The summary.tsv lists all the motif found by motif searching analysis. The files within "fimo2..." folder shows the motif footprinting figure (.png), the motif sites for the motif (.bed), and the logratio binding scores at these sites (.logratio.txt).

Single locus cut profile

CUT&RUNTools allows users to obtain a single nucleotide resolution cut profile for a region of interest.

For more details, please see USAGE.md.

Footprinting for a user-specified motif

CUT&RUNTools can run the motif scanning and motif footprinting step on a user-specified motif, such as a motif from the public JASPAR database. The motif should be in the MEME format.

The script that you will want to look at is generate.footprinting.factor.specific.centipede.py that is in the directory macs2.narrow.aug18 or macs2.narrow.aug18.dedup.

To do this analysis:

pwd
/n/scratch2/qz64/workdir
cd macs2.narrow.aug18
./generate.footprinting.factor.specific.centipede.py -b MA.00001.agataa.meme -p 0.001 -n GATA1

Option -b specifies the MEME file. Option -p is the motif scanning p-value (recommended 0.0005, but for this example we will use 0.001 since GATA1 motif is quite short). Option -n is the name you give it.

The script will generate a custom script for GATA1 motif, called integrate.footprinting.GATA1.centipede.sh. With this script, then you can run it on a narrowPeak file as follows:

sbatch ./integrate.footprinting.GATA1.centipede.sh GATA1_HDP2_30min_S13_aligned_reads_peaks.narrowPeak

The output will be in the fimo.GATA1.result folder, with the cut frequency estimated, and log odds binding score estimated.

Want to try?

Download a small CUT&RUN experiment qzhudfci/datasets/src. This is GATA1 chr11 only. Follow along the instructions INSTALL.md, USAGE.md.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published