diff --git a/README.md b/README.md index 73e5bff..e60a405 100644 --- a/README.md +++ b/README.md @@ -1,360 +1,99 @@ # human_genomics_pipeline -A Snakemake workflow to process single samples or cohorts of paired-end sequencing data (WGS or WES) using [bwa](http://bio-bwa.sourceforge.net/) and [GATK4](https://gatk.broadinstitute.org/hc/en-us), taking the data from fastq to vcf. Quality control checks are also undertaken. The fastq files can be optionally trimmed and the pipeline can run on [NVIDIA GPU's](https://www.nvidia.com/en-gb/graphics-cards/) where [nvidia clara parabricks software is available](https://www.nvidia.com/en-us/docs/parabricks/quickstart-guide/software-overview/). This workflow is designed to follow the [GATK best practice workflow for germline short variant discovery (SNPs + Indels)](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-) and is designed to be followed by [vcf_annotation_pipeline](https://github.com/ESR-NZ/vcf_annotation_pipeline). This pipeline has been developed with human genetic data in mind, however we designed it to be species agnostic. Genetic data from other species can in theory can be analysed by setting a species-specific reference genome and variant databases in the configuration file. +A Snakemake workflow to process single samples (unrelated individuals) or cohort samples (related individuals) of paired-end sequencing data (WGS or WES) using [bwa](http://bio-bwa.sourceforge.net/) and [GATK4](https://gatk.broadinstitute.org/hc/en-us). Quality control checks are also undertaken. The fastq files can be optionally trimmed with [Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) and the pipeline can run on [NVIDIA GPU's](https://www.nvidia.com/en-gb/graphics-cards/) where [nvidia clara parabricks software is available](https://www.nvidia.com/en-us/docs/parabricks/quickstart-guide/software-overview/) for *significant* speedups in analysis times. This workflow is designed to follow the [GATK best practice workflow for germline short variant discovery (SNPs + Indels)](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-). This pipeline is designed to be followed by [vcf_annotation_pipeline](https://github.com/ESR-NZ/vcf_annotation_pipeline) and the data ingested into [scout](https://github.com/Clinical-Genomics/scout) for clinical interpretation. However, this pipeline also stands on it's own, taking the data from fastq to vcf (raw sequencing data to called variants). This pipeline has been developed with human genetic data in mind, however we designed it to be species agnostic. Genetic data from other species can be analysed by setting a species-specific reference genome and variant databases in the configuration file (but not all situations have been tested). - [human_genomics_pipeline](#human_genomics_pipeline) - - [Workflow diagram - single samples](#workflow-diagram---single-samples) - - [Workflow diagram - single samples - GPU accelerated](#workflow-diagram---single-samples---gpu-accelerated) - - [Workflow diagram - cohort samples](#workflow-diagram---cohort-samples) - - [Workflow diagram - cohort samples - GPU accelerated](#workflow-diagram---cohort-samples---gpu-accelerated) + - [Pipeline summary - single samples](#pipeline-summary---single-samples) + - [Pipeline summary - single samples - GPU accelerated](#pipeline-summary---single-samples---gpu-accelerated) + - [Pipeline summary - cohort samples](#pipeline-summary---cohort-samples) + - [Pipeline summary - cohort samples - GPU accelerated](#pipeline-summary---cohort-samples---gpu-accelerated) + - [Main output files](#main-output-files) + - [Prerequisites](#prerequisites) + - [Test human_genomics_pipeline](#test-human_genomics_pipeline) - [Run human_genomics_pipeline](#run-human_genomics_pipeline) - - [1. Fork the pipeline repo to a personal or lab account](#1-fork-the-pipeline-repo-to-a-personal-or-lab-account) - - [2. Take the pipeline to the data on your local machine](#2-take-the-pipeline-to-the-data-on-your-local-machine) - - [3. Create a local copy of the GATK resource bundle (either b37 or hg38)](#3-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38) - - [b37](#b37) - - [hg38](#hg38) - - [4. Modify the configuration file](#4-modify-the-configuration-file) - - [Overall workflow](#overall-workflow) - - [Pipeline resources](#pipeline-resources) - - [Trimming](#trimming) - - [Base recalibration](#base-recalibration) - - [5. Configure to run on a HPC (optional)](#5-configure-to-run-on-a-hpc-optional) - - [6. Modify the run scripts](#6-modify-the-run-scripts) - - [HPC](#hpc) - - [7. Create and activate a conda environment with python and snakemake installed](#7-create-and-activate-a-conda-environment-with-python-and-snakemake-installed) - - [8. Run the pipeline](#8-run-the-pipeline) - - [HPC](#hpc-1) - - [9. Evaluate the pipeline run](#9-evaluate-the-pipeline-run) - - [10. Commit and push to your forked version of the github repo](#10-commit-and-push-to-your-forked-version-of-the-github-repo) - - [11. Repeat step 10 each time you re-run the analysis with different parameters](#11-repeat-step-10-each-time-you-re-run-the-analysis-with-different-parameters) - - [12. Create a pull request with the upstream repo to merge any useful changes (optional)](#12-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-optional) + - [Contribute back!](#contribute-back) -## Workflow diagram - single samples +## Pipeline summary - single samples - - -## Workflow diagram - single samples - GPU accelerated - - - -## Workflow diagram - cohort samples - - - -## Workflow diagram - cohort samples - GPU accelerated - - - -## Run human_genomics_pipeline - -- **Prerequisite hardware:** [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) (for GPU accelerated runs) -- **Prerequisite software:** [NVIDIA CLARA PARABRICKS and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/) (for GPU accelerated runs), [Git 2.7.4](https://git-scm.com/), [Mamba 0.4.4](https://github.com/TheSnakePit/mamba) with [Conda 4.8.2](https://docs.conda.io/projects/conda/en/latest/index.html), [gsutil 4.52](https://pypi.org/project/gsutil/), [gunzip 1.6](https://linux.die.net/man/1/gunzip) -- **OS:** Validated on Ubuntu 16.04 - -### 1. Fork the pipeline repo to a personal or lab account - -See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) for help - -### 2. Take the pipeline to the data on your local machine - -Clone the forked [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) repo into the same directory as your paired end fastq data to be processed. Required folder structure and file naming convention: - -```bash - -. -|___fastq/ -| |___sample1_1.fastq.gz -| |___sample1_2.fastq.gz -| |___sample2_1.fastq.gz -| |___sample2_2.fastq.gz -| |___ ... -| -|___human_genomics_pipeline/ - -``` - -If you are analysing cohort's of samples, you will need an additional directory with a [pedigree file](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format) for each cohort/family: - -```bash - -. -|___fastq/ -| |___sample1_1.fastq.gz -| |___sample1_2.fastq.gz -| |___sample2_1.fastq.gz -| |___sample2_2.fastq.gz -| |___ ... -| -|___pedigrees/ -| |___proband1_pedigree.ped -| |___proband2_pedigree.ped -| |___ ... -| -|___human_genomics_pipeline/ - -``` - -Requirements: - - Input paired end fastq files need to identified with `_1` and `_2` (not `_R1` and `_R2`) - - Currently, the filenames of the pedigree files need to be labelled with the name of the proband/individual affected with the disease phenotype in the cohort (we will be working towards removing this requirement) - - Singletons and cohorts need to be run in separate pipeline runs - -Assumptions: - - There is one proband/individual affected with the disease phenotype of interest in a given cohort (one individual with a value of 2 in the 6th column of the pedigree file) - -See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#keep-your-fork-synced) for help - -### 3. Create a local copy of the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) (either b37 or hg38) - -#### b37 - -Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37?prefix=) - -```bash -gsutil cp -r gs://gatk-legacy-bundles/b37/ /where/to/download/ -``` - -#### hg38 - -Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0) - -```bash -gsutil cp -r gs://genomics-public-data/resources/broad/hg38/ /where/to/download/ -``` - -### 4. Modify the configuration file - -Edit 'config.yaml' found within the config directory - -#### Overall workflow - -Specify whether the data is to be analysed on it's own ('Single') or as a part of a cohort of samples ('Cohort'). For example: - -```yaml -DATA: "Single" -``` - -Specify whether the pipeline should be GPU accelerated where possible (either 'Yes' or 'No', this requires [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS](https://www.nvidia.com/en-us/docs/parabricks/local-installation/)) - -```yaml -GPU_ACCELERATED: "No" -``` - -Set the the working directories to the reference human genome file (b37 or hg38). For example: - -```yaml -REFGENOME: "/home/lkemp/publicData/b37/human_g1k_v37_decoy.fasta" -``` +1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/)) +2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*) +3. Alignment against reference genome ([Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/)) +4. Mark duplicates ([GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs)) +5. Base recalibration ([GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator) and [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR)) +6. Haplotype calling ([GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)) -Set the the working directory to your dbSNP database file (b37 or hg38). For example: - -```yaml -dbSNP: "/home/lkemp/publicData/b37/dbsnp_138.b37.vcf" -``` - -Set the the working directory to a temporary file directory. For example: - -```yaml -TEMPDIR: "/home/lkemp/tmp/" -``` - -If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced to the `-L` flag (see [here](https://leahkemp.github.io/documentation/human_genomic_pipelines/design_files.html) for more information on accessing design files). Also set the level of padding by passing the amount of padding in base pairs to the `-ip` flag. For example: - -*If NOT analysing WES data, leave these fields blank* - -```yaml -WES: - # File path to the exome capture regions over which to operate (prefix with the '-L' flag) - INTERVALS: "-L /home/lkemp/publicData/sure_select_human_all_exon_V7/S31285117_Padded.bed" - # Padding (in bp) to add to each region (prefix with the '-ip' flag) - PADDING: "-ip 100" -``` - -#### Pipeline resources - -These settings allow you to configure the resources per rule/sample - -Set the number of threads to use per sample/rule for multithreaded rules (`rule trim_galore_pe` and `rule bwa_mem`). Multithreading will significantly speed up these rules, however the improvements in speed will diminish beyond 8 threads. If desired, a different number of threads can be set for these multithreaded rules by utilising the `--set-threads` flag in the runscript (see step 6). - -```yaml -THREADS: 8 -``` - -Set the maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes) - -```yaml -MAXMEMORY: "40g" -``` - -Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU) - -```yaml -GPU: 1 -``` - -It is a good idea to consider the number of samples that you are processing. For example, if you set `THREADS: "8"` and set the maximum number of cores to be used by the pipeline in the run script to `-j 32` (see step 6), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set `THREADS: "1"` and `-j 32`, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting `MAXMEMORY` + `--resources mem_mb` and `GPU` + `--resources gpu`. - -#### Trimming - -Specify whether the raw fastq reads should be trimmed (either 'Yes' or 'No'). For example: - -```yaml -TRIM: "Yes" -``` - -If trimming the raw fastq reads, set the [trim galore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) adapter trimming parameters. Choose one of the common adapters such as Illumina universal, Nextera transposase or Illumina small RNA with `--illumina`, `--nextera` or `--small_rna`. Alternatively, pass adapter sequences to the `-a` and `-a2` flags. If not set, trim galore will try to auto-detect the adapter based on the fastq reads. - -```yaml -TRIMMING: - ADAPTERS: "--illumina" -``` - -#### Base recalibration - -Pass the resources to be used to recalibrate bases with [gatk BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360047217531-BaseRecalibrator) to the `--known-sites` flag if not gpu accelerated and to the `--knownSites` if gpu accelerated. For example: - -```yaml -RECALIBRATION: - RESOURCES: "--known-sites /home/lkemp/publicData/b37/dbsnp_138.b37.vcf - --known-sites /home/lkemp/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf - --known-sites /home/lkemp/publicData/b37/1000G_phase1.indels.b37.vcf" -``` - -### 5. Configure to run on a HPC (optional) - -*This will deploy the non-GPU accelerated rules to slurm and deploy the GPU accelerated rules locally (pbrun_fq2bam, pbrun_haplotypecaller_single, pbrun_haplotypecaller_cohort). Therefore, if running the pipeline gpu accelerated, the pipeline should be deployed from the machine with the GPU's.* - -In theory, this cluster configuration should be adaptable to other job scheduler systems, but here I will provide a simple example for deploying this pipeline to [slurm](https://slurm.schedmd.com/). - -Configure `account:` and `partition:` in the default section of 'cluster.json' in order to set the parameters for slurm sbatch (see documentation [here](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated) and [here](https://slurm.schedmd.com/)). For example: - -```json -{ - "__default__" : - { - "account" : "lkemp", - "partition" : "prod" - } -} -``` -There are a plethora of additional slurm parameters that can be configured (and can be configured per rule). If you set additional slurm parameters, remember to pass them to the `--cluster` flag in the runscripts. See [here](https://snakemake-on-nesi.sschmeier.com/snake.html#slurm-and-nesi-specific-setup) and [here](https://hpc-carpentry.github.io/hpc-python/17-cluster/) for good working examples. - -### 6. Modify the run scripts - -Set the number maximum number of cores to be used with the `--cores` flag and the maximum amount of memory to be used (in megabytes) with the `resources mem_mb=` flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the `--resources gpu=` flag. For example: - -Dry run (dryrun.sh): - -```bash -snakemake \ ---dryrun \ ---cores 32 \ ---resources mem_mb=150000 \ ---resources gpu=2 \ ---use-conda \ ---conda-frontend mamba \ ---configfile ../config/config.yaml -``` - -Full run (run.sh): - -```bash -snakemake \ ---cores 32 \ ---resources mem_mb=150000 \ ---resources gpu=2 \ ---use-conda \ ---configfile ../config/config.yaml -``` - -See the [snakemake documentation](https://snakemake.readthedocs.io/en/v4.5.1/executable.html#all-options) for additional run parameters. - -#### HPC - -If you want to run the pipeline on a HPC, set the `--cores`, `resources mem_mb=`, and `--resources gpu=` flags in dryrun_hpc.sh and run_hpc.sh run scripts instead. For example: + -Dry run (dryrun_hpc.sh): +## Pipeline summary - single samples - GPU accelerated -```bash -snakemake \ ---dryrun \ ---cores 32 \ ---resources mem_mb=150000 \ ---resources gpu=2 \ ---use-conda \ ---conda-frontend mamba \ ---configfile ../config/config.yaml \ ---cluster-config ../config/cluster.json \ ---cluster "sbatch -A {cluster.account} \ --p {cluster.partition}" -``` +1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/)) +2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*) +3. Alignment against reference genome, mark duplicates, base recalibration and haplotype calling ([parabricks germline pipeline](https://docs.nvidia.com/clara/parabricks/v3.6.1/text/germline_pipeline.html)) + - *Equivilant to [Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/), [GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs), [GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator), [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR) and [GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)* -Full run (run_hpc.sh): + -```bash -snakemake \ ---cores 32 \ ---resources mem_mb=150000 \ ---resources gpu=2 \ ---use-conda \ ---conda-frontend mamba \ ---configfile ../config/config.yaml \ ---cluster-config ../config/cluster.json \ ---cluster "sbatch -A {cluster.account} \ --p {cluster.partition}" -``` +## Pipeline summary - cohort samples -### 7. Create and activate a conda environment with python and snakemake installed +1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/)) +2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*) +3. Alignment against reference genome ([Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/)) +4. Mark duplicates ([GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs)) +5. Base recalibration ([GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator) and [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR)) +6. Haplotype calling ([GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)) +7. Combine GVCF into multi-sample GVCF ([GATK CombineGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360037593911-CombineGVCFs)) +8. Genotyping ([GATK GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs)) -```bash -cd ./human_genomics_pipeline/workflow/ -mamba env create -f pipeline_run_env.yml -conda activate pipeline_run_env -``` + -### 8. Run the pipeline +## Pipeline summary - cohort samples - GPU accelerated -First carry out a dry run +1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/)) +2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*) +3. Alignment against reference genome, mark duplicates, base recalibration and haplotype calling ([parabricks germline pipeline](https://docs.nvidia.com/clara/parabricks/v3.6.1/text/germline_pipeline.html)) + - *Equivilant to [Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/), [GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs), [GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator), [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR) and [GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)* +4. Combine GVCF into multi-sample GVCF ([parabricks trio combine gvcf](https://docs.nvidia.com/clara/parabricks/v3.6/text/joint_calling.html#trio-combine-gvcf)) + - *Equivalent to [GATK CombineGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360037593911-CombineGVCFs)* +5. Genotyping ([GATK GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs)) -```bash -bash dryrun.sh -``` + -If there are no issues, start a full run +## Main output files -```bash -bash run.sh -``` +Single samples: -#### HPC +- `results/qc/multiqc_report.html` +- `results/mapped/sample1_recalibrated.bam` +- `results/called/sample1_raw_snps_indels.vcf` -If running on a HPC, run the HPC run scripts instead +Cohort samples: -```bash -bash dryrun_hpc.sh -bash run_hpc.sh -``` +- `results/qc/multiqc_report.html` +- `results/mapped/sample1_recalibrated.bam` +- `results/mapped/sample2_recalibrated.bam` +- `results/mapped/sample3_recalibrated.bam` +- `results/called/proband1_raw_snps_indels.g.vcf` -### 9. Evaluate the pipeline run +## Prerequisites -Generate an interactive html report +- **Prerequisite hardware:** [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) (for GPU accelerated runs) +- **Prerequisite software:** [NVIDIA CLARA parabricks and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/) (for GPU accelerated runs), [Git](https://git-scm.com/) (tested with version 2.7.4), [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.4.4) with [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.8.2), [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.52), [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.6) -```bash -bash report.sh -``` +## Test human_genomics_pipeline -### 10. Commit and push to your forked version of the github repo +The provided [test dataset](./test) can be used to test running this pipeline on a new machine, or test pipeline developments/releases. -To maintain reproducibility, commit and push: +## Run human_genomics_pipeline -- All configuration files -- All run scripts -- The final report +See the docs for a walkthrough guide for running [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) on: -### 11. Repeat step 10 each time you re-run the analysis with different parameters +- [A single machine like a laptop or single server/computer](./docs/running_on_a_single_machine.md) +- [A high performance cluster](./docs/running_on_a_hpc.md) -### 12. Create a pull request with the [upstream repo](https://github.com/ESR-NZ/human_genomics_pipeline) to merge any useful changes (optional) +## Contribute back! -Contributions and feedback are more than welcome! :blush: +- Raise issues in [the issues page](https://github.com/ESR-NZ/human_genomics_pipeline/issues) +- Create feature requests in [the issues page](https://github.com/ESR-NZ/human_genomics_pipeline/issues) +- Contribute your code! Create your own branch from the [development branch](https://github.com/ESR-NZ/human_genomics_pipeline/tree/dev) and create a pull request to the [development branch](https://github.com/ESR-NZ/human_genomics_pipeline/tree/dev) once the code is on point! -See [here](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request) for help +Contributions and feedback are always welcome! :blush: diff --git a/config/cluster.json b/config/cluster.json index 7b7d733..0738a2e 100644 --- a/config/cluster.json +++ b/config/cluster.json @@ -2,6 +2,7 @@ "__default__" : { "account" : "", - "partition" : "" + "partition" : "", + "output" : "logs/slurm-%j_{rule}.out" } } \ No newline at end of file diff --git a/config/config.yaml b/config/config.yaml index 866a435..c4879fd 100644 --- a/config/config.yaml +++ b/config/config.yaml @@ -19,9 +19,9 @@ TEMPDIR: "" # Whole exome sequence settings (leave blank if analysing other data such as whole genome sequence data) WES: - # File path to the exome capture regions over which to operate (prefix with the '-L' flag) + # File path to the exome capture regions over which to operate INTERVALS: "" - # Padding (in bp) to add to each region (prefix with the '-ip' flag) + # Padding (in bp) to add to each region PADDING: "" ############################## @@ -44,7 +44,7 @@ GPU: # Whether or not to trim the raw fastq reads (either 'Yes' or 'No') TRIM: "" -# If trimming, choose the adapter sequence to be trimmed (eg. `--illumina`, `--nextera` or `--small_rna`) +# If trimming, choose the adapter sequence to be trimmed (eg. `--illumina`, `--nextera` or `--small_rna`) or pass adapter sequences to the `-a` and `-a2` flags TRIMMING: ADAPTERS: "" @@ -52,6 +52,9 @@ TRIMMING: ##### Base recalibration ##### ############################## -# Resources to used for base recalibration (prefix each resource with the '--known-sites' flag if not gpu accelerated and the '--knownSites' if gpu accelerated) +# List of resources to used for base recalibration RECALIBRATION: - RESOURCES: "" \ No newline at end of file + RESOURCES: + - + - + - \ No newline at end of file diff --git a/docs/running_on_a_hpc.md b/docs/running_on_a_hpc.md new file mode 100644 index 0000000..f86d150 --- /dev/null +++ b/docs/running_on_a_hpc.md @@ -0,0 +1,347 @@ +# Run human_genomics_pipeline on a high performance cluster + +## Table of contents + +- [Run human_genomics_pipeline on a high performance cluster](#run-human_genomics_pipeline-on-a-high-performance-cluster) + - [Table of contents](#table-of-contents) + - [1. Fork the pipeline repo to a personal or lab account](#1-fork-the-pipeline-repo-to-a-personal-or-lab-account) + - [2. Take the pipeline to the data on the HPC](#2-take-the-pipeline-to-the-data-on-the-hpc) + - [3. Setup files and directories](#3-setup-files-and-directories) + - [Test data](#test-data) + - [4. Get prerequisite software/hardware](#4-get-prerequisite-softwarehardware) + - [5. Create a local copy of the GATK resource bundle (either b37 or hg38)](#5-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38) + - [b37](#b37) + - [hg38](#hg38) + - [6. Modify the configuration file](#6-modify-the-configuration-file) + - [Overall workflow](#overall-workflow) + - [Pipeline resources](#pipeline-resources) + - [Trimming](#trimming) + - [Base recalibration](#base-recalibration) + - [7. Configure to run on a HPC](#7-configure-to-run-on-a-hpc) + - [8. Modify the run scripts](#8-modify-the-run-scripts) + - [9. Create and activate a conda environment with python and snakemake installed](#9-create-and-activate-a-conda-environment-with-python-and-snakemake-installed) + - [10. Run the pipeline](#10-run-the-pipeline) + - [11. Evaluate the pipeline run](#11-evaluate-the-pipeline-run) + - [12. Commit and push to your forked version of the github repo](#12-commit-and-push-to-your-forked-version-of-the-github-repo) + - [13. Repeat step 12 each time you re-run the analysis with different parameters](#13-repeat-step-12-each-time-you-re-run-the-analysis-with-different-parameters) + - [14. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)](#14-raise-issues-create-feature-requests-or-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-to-the-pipeline-optional) + +## 1. Fork the pipeline repo to a personal or lab account + +See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) for help forking a repository + +## 2. Take the pipeline to the data on the HPC + +Clone the forked [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) repo into the same directory as your paired end fastq data to be processed. + +See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#keep-your-fork-synced) for help cloning a repository + +## 3. Setup files and directories + +Required folder structure and file naming convention: + +```bash + +. +|___fastq/ +| |___sample1_1.fastq.gz +| |___sample1_2.fastq.gz +| |___sample2_1.fastq.gz +| |___sample2_2.fastq.gz +| |___sample3_1.fastq.gz +| |___sample3_2.fastq.gz +| |___sample4_1.fastq.gz +| |___sample4_2.fastq.gz +| |___sample5_1.fastq.gz +| |___sample5_2.fastq.gz +| |___sample6_1.fastq.gz +| |___sample6_2.fastq.gz +| |___ ... +| +|___human_genomics_pipeline/ + +``` + +If you're analysing cohort's of samples, you will need an additional directory with a [pedigree file](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format) for each cohort/family using the following folder structure and file naming convention: + +```bash + +. +|___fastq/ +| |___sample1_1.fastq.gz +| |___sample1_2.fastq.gz +| |___sample2_1.fastq.gz +| |___sample2_2.fastq.gz +| |___sample3_1.fastq.gz +| |___sample3_2.fastq.gz +| |___sample4_1.fastq.gz +| |___sample4_2.fastq.gz +| |___sample5_1.fastq.gz +| |___sample5_2.fastq.gz +| |___sample6_1.fastq.gz +| |___sample6_2.fastq.gz +| |___ ... +| +|___pedigrees/ +| |___proband1_pedigree.ped +| |___proband2_pedigree.ped +| |___ ... +| +|___human_genomics_pipeline/ + +``` + +Requirements: + +- Input paired end fastq files need to identified with `_1` and `_2` (not `_R1` and `_R2`) +- Currently, the filenames of the pedigree files need to be labelled with the name of the proband/individual affected with the disease phenotype in the cohort (we will be working towards removing this requirement) +- Singletons and cohorts need to be run in separate pipeline runs +- It is assumed that there is one proband/individual affected with the disease phenotype of interest in a given cohort (one individual with a value of 2 in the 6th column of a given pedigree file) + +### Test data + +The provided [test dataset](./test) can be used. Setup the test dataset before running the pipeline on this data - choose to setup to run either a single sample analysis or a cohort analysis with the `-a` flag. For example: + +```bash +cd ./human_genomics_pipeline +bash ./test/setup_test.sh -a cohort +``` + +## 4. Get prerequisite software/hardware + +For GPU accelerated runs, you'll need [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/). Talk to your system administrator to see if the HPC has this hardware and software available. + +Other software required to get setup and run the pipeline: + +- [Git](https://git-scm.com/) (tested with version 2.7.4) +- [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.8.2) +- [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.4.4) (note. [mamba can be installed via conda with a single command](https://mamba.readthedocs.io/en/latest/installation.html#existing-conda-install)) +- [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.52) +- [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.6) + +Most of this software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this. + +## 5. Create a local copy of the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) (either b37 or hg38) + +### b37 + +Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37?prefix=) + +```bash +gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/ +``` + +### hg38 + +Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0) + +```bash +gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/ +``` + +## 6. Modify the configuration file + +Edit 'config.yaml' found within the config directory + +### Overall workflow + +Specify whether the data is to be analysed on it's own ('Single') or as a part of a cohort of samples ('Cohort'). For example: + +```yaml +DATA: "Single" +``` + +Specify whether the pipeline should be GPU accelerated where possible (either 'Yes' or 'No', this requires [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS](https://www.nvidia.com/en-us/docs/parabricks/local-installation/)) + +```yaml +GPU_ACCELERATED: "No" +``` + +Set the the working directories to the reference human genome file (b37 or hg38). For example: + +```yaml +REFGENOME: "/scratch/publicData/b37/human_g1k_v37_decoy.fasta" +``` + +Set the the working directory to your dbSNP database file (b37 or hg38). For example: + +```yaml +dbSNP: "/scratch/publicData/b37/dbsnp_138.b37.vcf" +``` + +Set the the working directory to a temporary file directory. Make sure this is a location with a fair amount of memory space for large intermediate analysis files. For example: + +```yaml +TEMPDIR: "/scratch/tmp/" +``` + +If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see [here](https://leahkemp.github.io/documentation/human_genomic_pipelines/design_files.html) for more information on accessing design files). Also set the level of padding by passing the amount of padding in base pairs. For example: + +*If NOT analysing WES data, leave these fields blank* + +```yaml +WES: + # File path to the exome capture regions over which to operate + INTERVALS: "/scratch/publicData/sure_select_human_all_exon_V7/S31285117_Padded.bed" + # Padding (in bp) to add to each region + PADDING: "100" +``` + +### Pipeline resources + +These settings allow you to configure the resources per rule/sample + +Set the number of threads to use per sample/rule for multithreaded rules (`rule trim_galore_pe` and `rule bwa_mem`). Multithreading will significantly speed up these rules, however the improvements in speed will diminish beyond 8 threads. If desired, a different number of threads can be set for these multithreaded rules by utilising the `--set-threads` flag in the runscript (see [step 8](#8-modify-the-run-scripts)). + +```yaml +THREADS: 8 +``` + +Set the maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes) + +```yaml +MAXMEMORY: "40g" +``` + +Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU) + +```yaml +GPU: 1 +``` + +It is a good idea to consider the number of samples that you are processing. For example, if you set `THREADS: "8"` and set the maximum number of cores to be used by the pipeline in the run script to `-j 32` (see step 6), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set `THREADS: "1"` and `-j 32`, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting `MAXMEMORY` + `--resources mem_mb` and `GPU` + `--resources gpu`. + +#### Trimming + +Specify whether the raw fastq reads should be trimmed (either 'Yes' or 'No'). For example: + +```yaml +TRIM: "Yes" +``` + +*Note. if you'd like to use a different trimming tool that you feel is better for your use case/data, you can pre-trim your fastq files and pass the trimmed fastq's to this pipeline, turning off this pipelines internal trimming with as outlined above* + +If trimming the raw fastq reads, set the [trim galore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) adapter trimming parameters. Choose one of the common adapters such as Illumina universal, Nextera transposase or Illumina small RNA with `--illumina`, `--nextera` or `--small_rna`. Alternatively, pass adapter sequences to the `-a` and `-a2` flags. If not set, trim galore will try to auto-detect the adapter based on the fastq reads. + +```yaml +TRIMMING: + ADAPTERS: "--illumina" +``` + +### Base recalibration + +Pass the resources to be used to recalibrate bases with [gatk BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator) (this passes the resource files to the [`--known-sites`](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator#--known-sites) flag), these known polymorphic sites will be used to exclude regions around known polymorphisms from base recalibration. Note. you can include as many or as few resources as you like, but you'll need at least one recalibration resource file. For example: + +```yaml +RECALIBRATION: + RESOURCES: + - /scratch/publicData/b37/dbsnp_138.b37.vcf + - /scratch/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf + - /scratch/publicData/b37/1000G_phase1.indels.b37.vcf +``` + +## 7. Configure to run on a HPC + +*This will deploy the non-GPU accelerated rules to slurm and deploy the GPU accelerated rules locally (pbrun_fq2bam, pbrun_haplotypecaller_single, pbrun_haplotypecaller_cohort). Therefore, if running the pipeline gpu accelerated, the pipeline should be deployed from the machine with the GPU's.* + +In theory, this cluster configuration should be adaptable to other job scheduler systems. In fact, snakemake is designed to allow pipelines/workflows such as this to be portable to different job schedulers by having any [any cluster specific configured outside of the workflow/pipeline](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated). but here I will provide a minimal example for deploying this pipeline to [slurm](https://slurm.schedmd.com/). + +Configure `account:` and `partition:` in the default section of 'cluster.json' in order to set the parameters for slurm sbatch (see documentation [here](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated) and [here](https://slurm.schedmd.com/)). For example: + +```json +{ + "__default__" : + { + "account" : "lkemp", + "partition" : "prod", + "output" : "logs/slurm-%j_{rule}.out" + } +} +``` + +There are a plethora of additional slurm parameters that can be configured (and can be configured per rule). If you set additional slurm parameters, remember to pass them to the `--cluster` flag in the runscripts. See [here](https://snakemake-on-nesi.sschmeier.com/snake.html#slurm-and-nesi-specific-setup) for a good working example of deploying a snakemake workflow to [NeSi](https://www.nesi.org.nz/). + +## 8. Modify the run scripts + +Set the number maximum number of cores to be used with the `--cores` flag and the maximum amount of memory to be used (in megabytes) with the `resources mem_mb=` flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the `--resources gpu=` flag. For example: + +Dry run (dryrun_hpc.sh): + +```bash +snakemake \ +--dryrun \ +--cores 32 \ +--resources mem_mb=150000 \ +--resources gpu=2 \ +--use-conda \ +--conda-frontend mamba \ +--latency-wait 120 \ +--configfile ../config/config.yaml \ +--cluster-config ../config/cluster.json \ +--cluster "sbatch -A {cluster.account} \ +-p {cluster.partition} \ +-o {cluster.output}" +``` + +Full run (run_hpc.sh): + +```bash +snakemake \ +--cores 32 \ +--resources mem_mb=150000 \ +--resources gpu=2 \ +--use-conda \ +--conda-frontend mamba \ +--latency-wait 120 \ +--configfile ../config/config.yaml \ +--cluster-config ../config/cluster.json \ +--cluster "sbatch -A {cluster.account} \ +-p {cluster.partition} \ +-o {cluster.output}" +``` + +See the [snakemake documentation](https://snakemake.readthedocs.io/en/v4.5.1/executable.html#all-options) for additional run parameters. + +## 9. Create and activate a conda environment with python and snakemake installed + +```bash +cd ./workflow/ +mamba env create -f pipeline_run_env.yml +conda activate pipeline_run_env +``` + +## 10. Run the pipeline + +First carry out a dry run + +```bash +bash dryrun_hpc.sh +``` + +If there are no issues, start a full run + +```bash +bash run_hpc.sh +``` + +## 11. Evaluate the pipeline run + +Generate an interactive html report + +```bash +bash report.sh +``` + +## 12. Commit and push to your forked version of the github repo + +To maintain reproducibility, commit and push: + +- All configuration files +- All run scripts +- The final report + +## 13. Repeat step 12 each time you re-run the analysis with different parameters + +## 14. Raise issues, create feature requests or create a pull request with the [upstream repo](https://github.com/ESR-NZ/human_genomics_pipeline) to merge any useful changes to the pipeline (optional) + +See [the README](https://github.com/ESR-NZ/human_genomics_pipeline/blob/dev/README.md#contribute-back) for info on how to contribute back to the pipeline! diff --git a/docs/running_on_a_single_machine.md b/docs/running_on_a_single_machine.md new file mode 100644 index 0000000..e426315 --- /dev/null +++ b/docs/running_on_a_single_machine.md @@ -0,0 +1,314 @@ +# Run human_genomics_pipeline on a single machine like a laptop or single server/computer + +## Table of contents + +- [Run human_genomics_pipeline on a single machine like a laptop or single server/computer](#run-human_genomics_pipeline-on-a-single-machine-like-a-laptop-or-single-servercomputer) + - [Table of contents](#table-of-contents) + - [1. Fork the pipeline repo to a personal or lab account](#1-fork-the-pipeline-repo-to-a-personal-or-lab-account) + - [2. Take the pipeline to the data on your local machine](#2-take-the-pipeline-to-the-data-on-your-local-machine) + - [3. Setup files and directories](#3-setup-files-and-directories) + - [Test data](#test-data) + - [4. Get prerequisite software/hardware](#4-get-prerequisite-softwarehardware) + - [5. Create a local copy of the GATK resource bundle (either b37 or hg38)](#5-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38) + - [b37](#b37) + - [hg38](#hg38) + - [6. Modify the configuration file](#6-modify-the-configuration-file) + - [Overall workflow](#overall-workflow) + - [Pipeline resources](#pipeline-resources) + - [Trimming](#trimming) + - [Base recalibration](#base-recalibration) + - [7. Modify the run scripts](#7-modify-the-run-scripts) + - [8. Create and activate a conda environment with python and snakemake installed](#8-create-and-activate-a-conda-environment-with-python-and-snakemake-installed) + - [9. Run the pipeline](#9-run-the-pipeline) + - [10. Evaluate the pipeline run](#10-evaluate-the-pipeline-run) + - [11. Commit and push to your forked version of the github repo](#11-commit-and-push-to-your-forked-version-of-the-github-repo) + - [12. Repeat step 11 each time you re-run the analysis with different parameters](#12-repeat-step-11-each-time-you-re-run-the-analysis-with-different-parameters) + - [13. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)](#13-raise-issues-create-feature-requests-or-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-to-the-pipeline-optional) + +## 1. Fork the pipeline repo to a personal or lab account + +See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) for help forking a repository + +## 2. Take the pipeline to the data on your local machine + +Clone the forked [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) repo into the same directory as your paired end fastq data to be processed. + +See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#keep-your-fork-synced) for help cloning a repository + +## 3. Setup files and directories + +Required folder structure and file naming convention: + +```bash + +. +|___fastq/ +| |___sample1_1.fastq.gz +| |___sample1_2.fastq.gz +| |___sample2_1.fastq.gz +| |___sample2_2.fastq.gz +| |___sample3_1.fastq.gz +| |___sample3_2.fastq.gz +| |___sample4_1.fastq.gz +| |___sample4_2.fastq.gz +| |___sample5_1.fastq.gz +| |___sample5_2.fastq.gz +| |___sample6_1.fastq.gz +| |___sample6_2.fastq.gz +| |___ ... +| +|___human_genomics_pipeline/ + +``` + +If you're analysing cohort's of samples, you will need an additional directory with a [pedigree file](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format) for each cohort/family using the following folder structure and file naming convention: + +```bash + +. +|___fastq/ +| |___sample1_1.fastq.gz +| |___sample1_2.fastq.gz +| |___sample2_1.fastq.gz +| |___sample2_2.fastq.gz +| |___sample3_1.fastq.gz +| |___sample3_2.fastq.gz +| |___sample4_1.fastq.gz +| |___sample4_2.fastq.gz +| |___sample5_1.fastq.gz +| |___sample5_2.fastq.gz +| |___sample6_1.fastq.gz +| |___sample6_2.fastq.gz +| |___ ... +| +|___pedigrees/ +| |___proband1_pedigree.ped +| |___proband2_pedigree.ped +| |___ ... +| +|___human_genomics_pipeline/ + +``` + +Requirements: + +- Input paired end fastq files need to identified with `_1` and `_2` (not `_R1` and `_R2`) +- Currently, the filenames of the pedigree files need to be labelled with the name of the proband/individual affected with the disease phenotype in the cohort (we will be working towards removing this requirement) +- Singletons and cohorts need to be run in separate pipeline runs +- It is assumed that there is one proband/individual affected with the disease phenotype of interest in a given cohort (one individual with a value of 2 in the 6th column of a given pedigree file) + +### Test data + +The provided [test dataset](./test) can be used. Setup the test dataset before running the pipeline on this data - choose to setup to run either a single sample analysis or a cohort analysis with the `-a` flag. For example: + +```bash +cd ./human_genomics_pipeline +bash ./test/setup_test.sh -a cohort +``` + +## 4. Get prerequisite software/hardware + +For GPU accelerated runs, you'll need [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/). Talk to your system administrator to see if the HPC has this hardware and software available. + +Other software required to get setup and run the pipeline: + +- [Git](https://git-scm.com/) (tested with version 2.7.4) +- [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.8.2) +- [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.4.4) (note. [mamba can be installed via conda with a single command](https://mamba.readthedocs.io/en/latest/installation.html#existing-conda-install)) +- [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.52) +- [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.6) + +Most of this software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this. + +## 5. Create a local copy of the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) (either b37 or hg38) + +### b37 + +Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37?prefix=) + +```bash +gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/ +``` + +### hg38 + +Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0) + +```bash +gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/ +``` + +## 6. Modify the configuration file + +Edit 'config.yaml' found within the config directory + +### Overall workflow + +Specify whether the data is to be analysed on it's own ('Single') or as a part of a cohort of samples ('Cohort'). For example: + +```yaml +DATA: "Single" +``` + +Specify whether the pipeline should be GPU accelerated where possible (either 'Yes' or 'No', this requires [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS](https://www.nvidia.com/en-us/docs/parabricks/local-installation/)) + +```yaml +GPU_ACCELERATED: "No" +``` + +Set the the working directories to the reference human genome file (b37 or hg38). For example: + +```yaml +REFGENOME: "/scratch/publicData/b37/human_g1k_v37_decoy.fasta" +``` + +Set the the working directory to your dbSNP database file (b37 or hg38). For example: + +```yaml +dbSNP: "/scratch/publicData/b37/dbsnp_138.b37.vcf" +``` + +Set the the working directory to a temporary file directory. Make sure this is a location with a fair amount of memory space for large intermediate analysis files. For example: + +```yaml +TEMPDIR: "/scratch/tmp/" +``` + +If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see [here](https://leahkemp.github.io/documentation/human_genomic_pipelines/design_files.html) for more information on accessing design files). Also set the level of padding by passing the amount of padding in base pairs. For example: + +*If NOT analysing WES data, leave these fields blank* + +```yaml +WES: + # File path to the exome capture regions over which to operate + INTERVALS: "/scratch/publicData/sure_select_human_all_exon_V7/S31285117_Padded.bed" + # Padding (in bp) to add to each region + PADDING: "100" +``` + +### Pipeline resources + +These settings allow you to configure the resources per rule/sample + +Set the number of threads to use per sample/rule for multithreaded rules (`rule trim_galore_pe` and `rule bwa_mem`). Multithreading will significantly speed up these rules, however the improvements in speed will diminish beyond 8 threads. If desired, a different number of threads can be set for these multithreaded rules by utilising the `--set-threads` flag in the runscript (see [step 7](#7-modify-the-run-scripts)). + +```yaml +THREADS: 8 +``` + +Set the maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes) + +```yaml +MAXMEMORY: "40g" +``` + +Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU) + +```yaml +GPU: 1 +``` + +It is a good idea to consider the number of samples that you are processing. For example, if you set `THREADS: "8"` and set the maximum number of cores to be used by the pipeline in the run script to `-j 32` (see step 6), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set `THREADS: "1"` and `-j 32`, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting `MAXMEMORY` + `--resources mem_mb` and `GPU` + `--resources gpu`. + +#### Trimming + +Specify whether the raw fastq reads should be trimmed (either 'Yes' or 'No'). For example: + +```yaml +TRIM: "Yes" +``` + +If trimming the raw fastq reads, set the [trim galore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) adapter trimming parameters. Choose one of the common adapters such as Illumina universal, Nextera transposase or Illumina small RNA with `--illumina`, `--nextera` or `--small_rna`. Alternatively, pass adapter sequences to the `-a` and `-a2` flags. If not set, trim galore will try to auto-detect the adapter based on the fastq reads. + +```yaml +TRIMMING: + ADAPTERS: "--illumina" +``` + +### Base recalibration + +Pass the resources to be used to recalibrate bases with [gatk BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator) (this passes the resource files to the [`--known-sites`](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator#--known-sites) flag), these known polymorphic sites will be used to exclude regions around known polymorphisms from base recalibration. Note. you can include as many or as few resources as you like, but you'll need at least one recalibration resource file. For example: + +```yaml +RECALIBRATION: + RESOURCES: + - /scratch/publicData/b37/dbsnp_138.b37.vcf + - /scratch/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf + - /scratch/publicData/b37/1000G_phase1.indels.b37.vcf +``` + +## 7. Modify the run scripts + +Set the number maximum number of cores to be used with the `--cores` flag and the maximum amount of memory to be used (in megabytes) with the `resources mem_mb=` flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the `--resources gpu=` flag. For example: + +Dry run (dryrun.sh): + +```bash +snakemake \ +--dryrun \ +--cores 32 \ +--resources mem_mb=150000 \ +--resources gpu=2 \ +--use-conda \ +--conda-frontend mamba \ +--latency-wait 120 \ +--configfile ../config/config.yaml +``` + +Full run (run.sh): + +```bash +snakemake \ +--cores 32 \ +--resources mem_mb=150000 \ +--resources gpu=2 \ +--use-conda \ +--latency-wait 120 \ +--configfile ../config/config.yaml +``` + +See the [snakemake documentation](https://snakemake.readthedocs.io/en/v4.5.1/executable.html#all-options) for additional run parameters. + +## 8. Create and activate a conda environment with python and snakemake installed + +```bash +cd ./workflow/ +mamba env create -f pipeline_run_env.yml +conda activate pipeline_run_env +``` + +## 9. Run the pipeline + +First carry out a dry run + +```bash +bash dryrun.sh +``` + +If there are no issues, start a full run + +```bash +bash run.sh +``` + +## 10. Evaluate the pipeline run + +Generate an interactive html report + +```bash +bash report.sh +``` + +## 11. Commit and push to your forked version of the github repo + +To maintain reproducibility, commit and push: + +- All configuration files +- All run scripts +- The final report + +## 12. Repeat step 11 each time you re-run the analysis with different parameters + +## 13. Raise issues, create feature requests or create a pull request with the [upstream repo](https://github.com/ESR-NZ/human_genomics_pipeline) to merge any useful changes to the pipeline (optional) + +See [the README](https://github.com/ESR-NZ/human_genomics_pipeline/blob/dev/README.md#contribute-back) for info on how to contribute back to the pipeline! diff --git a/images/jupyterLauncher.png b/images/jupyterLauncher.png new file mode 100644 index 0000000..ecaeb61 Binary files /dev/null and b/images/jupyterLauncher.png differ diff --git a/images/jupyter_login_labels_updated.png b/images/jupyter_login_labels_updated.png new file mode 100644 index 0000000..a6b9c83 Binary files /dev/null and b/images/jupyter_login_labels_updated.png differ diff --git a/images/nesi99991_screenshot.png b/images/nesi99991_screenshot.png new file mode 100644 index 0000000..df1d426 Binary files /dev/null and b/images/nesi99991_screenshot.png differ diff --git a/images/rulegraph_cohort.png b/images/rulegraph_cohort.png index 23bd7e8..87960f3 100644 Binary files a/images/rulegraph_cohort.png and b/images/rulegraph_cohort.png differ diff --git a/images/rulegraph_cohort_gpu.png b/images/rulegraph_cohort_gpu.png index 41ee9fc..aed8fb5 100644 Binary files a/images/rulegraph_cohort_gpu.png and b/images/rulegraph_cohort_gpu.png differ diff --git a/images/rulegraph_single.png b/images/rulegraph_single.png index c5b5c93..30cc46a 100644 Binary files a/images/rulegraph_single.png and b/images/rulegraph_single.png differ diff --git a/images/rulegraph_single_gpu.png b/images/rulegraph_single_gpu.png index a09196c..1b7d8e0 100644 Binary files a/images/rulegraph_single_gpu.png and b/images/rulegraph_single_gpu.png differ diff --git a/test/cohort/fastq/NA24631_1.fastq.gz b/test/cohort/fastq/NA24631_1.fastq.gz new file mode 100644 index 0000000..a8dc360 Binary files /dev/null and b/test/cohort/fastq/NA24631_1.fastq.gz differ diff --git a/test/cohort/fastq/NA24631_2.fastq.gz b/test/cohort/fastq/NA24631_2.fastq.gz new file mode 100644 index 0000000..370972f Binary files /dev/null and b/test/cohort/fastq/NA24631_2.fastq.gz differ diff --git a/test/cohort/fastq/NA24694_1.fastq.gz b/test/cohort/fastq/NA24694_1.fastq.gz new file mode 100644 index 0000000..a9a1901 Binary files /dev/null and b/test/cohort/fastq/NA24694_1.fastq.gz differ diff --git a/test/cohort/fastq/NA24694_2.fastq.gz b/test/cohort/fastq/NA24694_2.fastq.gz new file mode 100644 index 0000000..5141473 Binary files /dev/null and b/test/cohort/fastq/NA24694_2.fastq.gz differ diff --git a/test/cohort/fastq/NA24695_1.fastq.gz b/test/cohort/fastq/NA24695_1.fastq.gz new file mode 100644 index 0000000..de9ea50 Binary files /dev/null and b/test/cohort/fastq/NA24695_1.fastq.gz differ diff --git a/test/cohort/fastq/NA24695_2.fastq.gz b/test/cohort/fastq/NA24695_2.fastq.gz new file mode 100644 index 0000000..6b4e285 Binary files /dev/null and b/test/cohort/fastq/NA24695_2.fastq.gz differ diff --git a/test/cohort/pedigrees/NA24631_pedigree.ped b/test/cohort/pedigrees/NA24631_pedigree.ped new file mode 100644 index 0000000..db716cb --- /dev/null +++ b/test/cohort/pedigrees/NA24631_pedigree.ped @@ -0,0 +1,3 @@ +NA24631 NA24631 NA24694 NA24695 1 2 +NA24631 NA24694 0 0 1 1 +NA24631 NA24695 0 0 2 1 diff --git a/test/setup_test.sh b/test/setup_test.sh new file mode 100644 index 0000000..72f347d --- /dev/null +++ b/test/setup_test.sh @@ -0,0 +1,29 @@ +#!/bin/bash + +# Move test data to the correct location + +helpFunction() +{ + echo "" + echo "Usage: $0 -a parameterA" + echo -e "\t-a choose to setup to run either a single sample analysis (-a single) or a cohort analysis (-a cohort)" + exit 1 # Exit script after printing help +} + +while getopts "a:" opt +do + case "$opt" in + a ) parameterA="$OPTARG" ;; + ? ) helpFunction ;; # Print helpFunction in case parameter is non-existent + esac +done + +# Print helpFunction in case parameters are empty +if [ -z "$parameterA" ] +then + echo "Value for parameter not provided"; + helpFunction +fi + +# Begin script in case all parameters are correct +cp -r ./test/"$parameterA"/* ../ \ No newline at end of file diff --git a/test/single/fastq/NA24631_1.fastq.gz b/test/single/fastq/NA24631_1.fastq.gz new file mode 100644 index 0000000..a8dc360 Binary files /dev/null and b/test/single/fastq/NA24631_1.fastq.gz differ diff --git a/test/single/fastq/NA24631_2.fastq.gz b/test/single/fastq/NA24631_2.fastq.gz new file mode 100644 index 0000000..370972f Binary files /dev/null and b/test/single/fastq/NA24631_2.fastq.gz differ diff --git a/test/test.md b/test/test.md new file mode 100644 index 0000000..894266f --- /dev/null +++ b/test/test.md @@ -0,0 +1,5 @@ +# Test dataset + +This directory contains a reduced dataset of the publicly available [Han Chinese GIAB trio](https://github.com/genome-in-a-bottle/about_GIAB) that can be used to test running this pipeline on a new machine, or test pipeline developments. + +Detailed documentation for how this test dataset was created can be found at https://github.com/leahkemp/documentation/blob/master/human_genomic_pipelines/create_test_dataset.md \ No newline at end of file diff --git a/workflow/Snakefile b/workflow/Snakefile index b1a3893..6558e1d 100644 --- a/workflow/Snakefile +++ b/workflow/Snakefile @@ -6,12 +6,144 @@ Date created: 2019-08-21 Modified: 2020-10-09 """ -##### Set up ##### +##### Set up wildcards ##### # Define samples from fastq dir and families/cohorts from pedigree dir using wildcards SAMPLES, = glob_wildcards("../../fastq/{sample}_1.fastq.gz") FAMILIES, = glob_wildcards("../../pedigrees/{family}_pedigree.ped") +##### Setup helper functions ##### +import csv +import glob + +def get_input_fastq(command): + """Return a string which defines the input fastq files for the bwa_mem and pbrun_germline rules. + This changes based on the user configurable options for trimming + """ + + input_files = "" + + if config['TRIM'] == "No" or config['TRIM'] == "no": + input_files = ["../../fastq/{sample}_1.fastq.gz", "../../fastq/{sample}_2.fastq.gz"] + if config['TRIM'] == "Yes" or config['TRIM'] == "yes": + input_files = ["../results/trimmed/{sample}_1_val_1.fq.gz", "../results/trimmed/{sample}_2_val_2.fq.gz"] + + return input_files + +def get_output_vcf(config): + """Return a string which defines the output vcf files for the pbrun_germline rule. This changes based on the + user configurable options for running single samples or cohorts of samples + """ + + vcf = "" + + if config['DATA'] == "Single" or config['DATA'] == 'single': + vcf = "../results/called/{sample}_raw_snps_indels.vcf" + if config['DATA'] == "Cohort" or config['DATA'] == 'cohort': + vcf = "../results/called/{sample}_raw_snps_indels_tmp.g.vcf" + + return vcf + +def get_params(command): + """Return a string which defines some parameters for the pbrun_germline rule. This changes based on the + user configurable options for running single samples or cohorts of samples + """ + + params = "" + + if config['DATA'] == "Single" or config['DATA'] == 'single': + params = "" + if config['DATA'] == "Cohort" or config['DATA'] == 'cohort': + params = "--gvcf" + + return params + +def get_gatk_combinegvcf_command(family): + """Return a string, a portion of the gatk CombineGVCF command which defines individuals which should be combined. This + command is used by the gatk_CombineGVCFs rule. For a particular family, we construct the gatk command by adding + -V for each individual (defined by individual id column in the pedigree file) + """ + filename = "../../pedigrees/" + str(family) + "_pedigree.ped" + + command = "" + with open(filename, newline = '') as pedigree: + + pedigree_reader = csv.DictReader(pedigree, fieldnames = ('family', 'individual_id', 'paternal_id', 'maternal_id', 'sex', 'phenotype'), delimiter='\t') + for individual in pedigree_reader: + command += "-V ../results/called/" + individual['individual_id'] + "_raw_snps_indels_tmp.g.vcf " + + return command + +def get_parabricks_combinegvcf_command(family): + """Return a string, a portion of the gatk CombineGVCF command which defines individuals which should be combined. This + command is used by the gatk_triocombinegvcf rule. For a particular family, we construct the gatk command by adding + -V for each individual (defined by individual id column in the pedigree file) + """ + filename = "../../pedigrees/" + str(family) + "_pedigree.ped" + + command = "" + with open(filename, newline = '') as pedigree: + + pedigree_reader = csv.DictReader(pedigree, fieldnames = ('family', 'individual_id', 'paternal_id', 'maternal_id', 'sex', 'phenotype'), delimiter='\t') + for individual in pedigree_reader: + command += "--in-gvcf ../results/called/" + individual['individual_id'] + "_raw_snps_indels_tmp.g.vcf " + + return command + +def get_recal_resources_command(resource): + """Return a string, a portion of the gatk BaseRecalibrator command (used in the gatk_BaseRecalibrator and the + parabricks_germline rules) which dynamically includes each of the recalibration resources defined by the user + in the configuration file. For each recalibration resource (element in the list), we construct the command by + adding either --knownSites (for parabricks) or --known-sites (for gatk4) + """ + + command = "" + + for resource in config['RECALIBRATION']['RESOURCES']: + if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes": + command += "--knownSites " + resource + " " + + if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no": + command += "--known-sites " + resource + " " + + return command + +def get_wes_intervals_command(resource): + """Return a string, a portion of the gatk command's (used in several gatk rules) which builds a flag + formatted for gatk based on the configuration file. We construct the command by adding either --interval-file + (for parabricks) or --L (for gatk4) . If the user provides nothing for these configurable + options, an empty string is returned + """ + + command = "" + + if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes": + command = "--interval-file " + config['WES']['PADDING'] + " " + if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no": + command = "--L " + config['WES']['PADDING'] + " " + if config['WES']['INTERVALS'] == "": + command = "" + + return command + +def get_wes_padding_command(resource): + """Return a string, a portion of the gatk command's (used in several gatk rules) which builds a flag + formatted for gatk based on the configuration file. We construct the command by adding either --interval + (for parabricks) or --ip (for gatk4) . If the user provides nothing for these configurable + options, an empty string is returned + """ + + command = "" + + if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes": + command = "--interval " + config['WES']['PADDING'] + " " + if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no": + command = "--ip " + config['WES']['PADDING'] + " " + if config['WES']['PADDING'] == "": + command = "" + + return command + #### Set up report ##### report: "report/workflow.rst" @@ -33,9 +165,9 @@ if config['DATA'] == "Cohort" or config['DATA'] == 'cohort': expand("../results/mapped/{sample}_recalibrated.bam", sample = SAMPLES), expand("../results/called/{family}_raw_snps_indels.g.vcf", family = FAMILIES) -##### load rules ##### +##### Load rules ##### -localrules: multiqc, pbrun_fq2bam, pbrun_haplotypecaller_single, pbrun_haplotypecaller_cohort +localrules: multiqc, pbrun_germline, pbrun_triocombinegvcf include: "rules/fastqc.smk" @@ -52,21 +184,17 @@ if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no": include: "rules/gatk_BaseRecalibrator.smk" include: "rules/gatk_ApplyBQSR.smk" -if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes": - include: "rules/pbrun_fq2bam.smk" - -if (config['DATA'] == "Single" or config['DATA'] == 'single') and (config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no"): +if (config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no") and (config['DATA'] == "Single" or config['DATA'] == 'single'): include: "rules/gatk_HaplotypeCaller_single.smk" -if (config['DATA'] == "Single" or config['DATA'] == 'single') and (config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes"): - include: "rules/pbrun_haplotypecaller_single.smk" - -if (config['DATA'] == "Cohort" or config['DATA'] == 'cohort') and (config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no"): +if (config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no") and (config['DATA'] == "Cohort" or config['DATA'] == 'cohort'): include: "rules/gatk_HaplotypeCaller_cohort.smk" + include: "rules/gatk_CombineGVCFs.smk" + include: "rules/gatk_GenotypeGVCFs.smk" -if (config['DATA'] == "Cohort" or config['DATA'] == 'cohort') and (config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes"): - include: "rules/pbrun_haplotypecaller_cohort.smk" +if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes": + include: "rules/pbrun_germline.smk" -if (config['DATA'] == "Cohort" or config['DATA'] == 'cohort'): - include: "rules/gatk_CombineGVCFs.smk" +if (config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes") and (config['DATA'] == "Cohort" or config['DATA'] == 'cohort'): + include: "rules/pbrun_triocombinegvcf.smk" include: "rules/gatk_GenotypeGVCFs.smk" \ No newline at end of file diff --git a/workflow/dryrun.sh b/workflow/dryrun.sh index d1bad36..4bdc112 100644 --- a/workflow/dryrun.sh +++ b/workflow/dryrun.sh @@ -7,4 +7,5 @@ snakemake \ --resources gpu=2 \ --use-conda \ --conda-frontend mamba \ +--latency-wait 120 \ --configfile ../config/config.yaml \ No newline at end of file diff --git a/workflow/dryrun_hpc.sh b/workflow/dryrun_hpc.sh index 7539cf6..f5e38e9 100644 --- a/workflow/dryrun_hpc.sh +++ b/workflow/dryrun_hpc.sh @@ -7,7 +7,9 @@ snakemake \ --resources gpu=2 \ --use-conda \ --conda-frontend mamba \ +--latency-wait 120 \ --configfile ../config/config.yaml \ --cluster-config ../config/cluster.json \ --cluster "sbatch -A {cluster.account} \ --p {cluster.partition}" \ No newline at end of file +-p {cluster.partition} \ +-o {cluster.output}" \ No newline at end of file diff --git a/workflow/envs/fastqc.yaml b/workflow/envs/fastqc.yaml new file mode 100644 index 0000000..c9cbea6 --- /dev/null +++ b/workflow/envs/fastqc.yaml @@ -0,0 +1,6 @@ +channels: + - bioconda + - conda-forge + - defaults +dependencies: + - bioconda::fastqc =0.11.9 \ No newline at end of file diff --git a/workflow/envs/multiqc.yaml b/workflow/envs/multiqc.yaml index a2e64ee..33eeacc 100644 --- a/workflow/envs/multiqc.yaml +++ b/workflow/envs/multiqc.yaml @@ -3,4 +3,4 @@ channels: - conda-forge - defaults dependencies: - - bioconda::multiqc =1.8 \ No newline at end of file + - bioconda::multiqc =1.11 \ No newline at end of file diff --git a/workflow/logs/README.md b/workflow/logs/README.md new file mode 100644 index 0000000..256d7b8 --- /dev/null +++ b/workflow/logs/README.md @@ -0,0 +1 @@ +Snakemake and slurm log files will reside here \ No newline at end of file diff --git a/workflow/rules/bwa_mem.smk b/workflow/rules/bwa_mem.smk index b3c9f2e..40b1062 100644 --- a/workflow/rules/bwa_mem.smk +++ b/workflow/rules/bwa_mem.smk @@ -1,15 +1,6 @@ -if config['TRIM'] == "No" or config['TRIM'] == "no": - R1 = "../../fastq/{sample}_1.fastq.gz" - R2 = "../../fastq/{sample}_2.fastq.gz" - -if config['TRIM'] == "Yes" or config['TRIM'] == "yes": - R1 = "../results/trimmed/{sample}_1_val_1.fq.gz" - R2 = "../results/trimmed/{sample}_2_val_2.fq.gz" - rule bwa_mem: input: - R1 = R1, - R2 = R2, + fastq = get_input_fastq, refgenome = expand("{refgenome}", refgenome = config['REFGENOME']) output: temp("../results/mapped/{sample}_sorted.bam") @@ -26,6 +17,6 @@ rule bwa_mem: "../envs/bwa.yaml" threads: config['THREADS'] message: - "Mapping sequences against a reference human genome with BWA-MEM for {input.R1} {input.R2}" + "Mapping sequences against a reference human genome with BWA-MEM for {input.fastq}" shell: - "bwa mem -R {params.readgroup} -t {threads} -K 10000000 {input.refgenome} {input.R1} {input.R2} | gatk SortSam --java-options {params.maxmemory} {params.sortsam} -O={output} --TMP_DIR={params.tdir} &> {log}" \ No newline at end of file + "bwa mem -R {params.readgroup} -t {threads} -K 10000000 {input.refgenome} {input.fastq} | gatk SortSam --java-options {params.maxmemory} {params.sortsam} -O={output} --TMP_DIR={params.tdir} &> {log}" \ No newline at end of file diff --git a/workflow/rules/fastqc.smk b/workflow/rules/fastqc.smk index 4f4b663..e36b906 100644 --- a/workflow/rules/fastqc.smk +++ b/workflow/rules/fastqc.smk @@ -2,14 +2,16 @@ rule fastqc: input: ["../../fastq/{sample}_1.fastq.gz", "../../fastq/{sample}_2.fastq.gz"] output: - html = "../results/qc/fastqc/{sample}.html", - zip = "../results/qc/fastqc/{sample}_fastqc.zip" - params: "" + html = ["../results/qc/fastqc/{sample}_1_fastqc.html", "../results/qc/fastqc/{sample}_2_fastqc.html"], + zip = ["../results/qc/fastqc/{sample}_1_fastqc.zip", "../results/qc/fastqc/{sample}_2_fastqc.zip"] log: "logs/fastqc/{sample}.log" benchmark: "benchmarks/fastqc/{sample}.tsv" + threads: config['THREADS'] + conda: + "../envs/fastqc.yaml" message: "Undertaking quality control checks on raw sequence data for {input}" - wrapper: - "0.64.0/bio/fastqc" \ No newline at end of file + shell: + "fastqc {input} -o ../results/qc/fastqc/ -t {threads} &> {log}" \ No newline at end of file diff --git a/workflow/rules/gatk_ApplyBQSR.smk b/workflow/rules/gatk_ApplyBQSR.smk index 1a8a2e9..9d3540a 100644 --- a/workflow/rules/gatk_ApplyBQSR.smk +++ b/workflow/rules/gatk_ApplyBQSR.smk @@ -7,8 +7,8 @@ rule gatk_ApplyBQSR: bam = protected("../results/mapped/{sample}_recalibrated.bam") params: maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']) + padding = get_wes_padding_command, + intervals = get_wes_intervals_command log: "logs/gatk_ApplyBQSR/{sample}.log" benchmark: diff --git a/workflow/rules/gatk_BaseRecalibrator.smk b/workflow/rules/gatk_BaseRecalibrator.smk index dc1915d..265ccb9 100644 --- a/workflow/rules/gatk_BaseRecalibrator.smk +++ b/workflow/rules/gatk_BaseRecalibrator.smk @@ -6,10 +6,10 @@ rule gatk_BaseRecalibrator: report("../results/mapped/{sample}_recalibration_report.grp", caption = "../report/recalibration.rst", category = "Base recalibration") params: maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']), - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']), - recalibration_resources = expand("{recalibration_resources}", recalibration_resources = config['RECALIBRATION']['RESOURCES']) + tdir = config['TEMPDIR'], + padding = get_wes_padding_command, + intervals = get_wes_intervals_command, + recalibration_resources = get_recal_resources_command log: "logs/gatk_BaseRecalibrator/{sample}.log" benchmark: diff --git a/workflow/rules/gatk_CombineGVCFs.smk b/workflow/rules/gatk_CombineGVCFs.smk index 2214f36..57d647c 100644 --- a/workflow/rules/gatk_CombineGVCFs.smk +++ b/workflow/rules/gatk_CombineGVCFs.smk @@ -1,22 +1,3 @@ -import csv - -def get_command(family): - """Return a string, a portion of the gatk command which defines individuals which should be combined. - - For a particular family, we construct the gatk command by adding -V for each individual - (defined by individual id column in the pedigree file) - """ - filename = "../../pedigrees/" + str(family) + "_pedigree.ped" - - command = "" - with open(filename, newline = '') as pedigree: - - pedigree_reader = csv.DictReader(pedigree, fieldnames = ('family', 'individual_id', 'paternal_id', 'maternal_id', 'sex', 'phenotype'), delimiter='\t') - for individual in pedigree_reader: - command += "-V ../results/called/" + individual['individual_id'] + "_raw_snps_indels_tmp.g.vcf " - - return command - rule gatk_CombineGVCFs: input: vcf_dummy = expand("../results/called/{sample}_raw_snps_indels_tmp.g.vcf", sample = SAMPLES), # a dummy vcf to connect this rule to gatk_HaplotypeCaller @@ -25,10 +6,10 @@ rule gatk_CombineGVCFs: vcf = temp("../results/called/{family}_raw_snps_indels_tmp_combined.g.vcf"), index = temp("../results/called/{family}_raw_snps_indels_tmp_combined.g.vcf.idx") params: - command = get_command, - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']), + command = get_gatk_combinegvcf_command, + tdir = config['TEMPDIR'], + padding = get_wes_padding_command, + intervals = get_wes_intervals_command, other = "-G StandardAnnotation -G AS_StandardAnnotation" log: "logs/gatk_CombineGVCFs/{family}.log" diff --git a/workflow/rules/gatk_GenotypeGVCFs.smk b/workflow/rules/gatk_GenotypeGVCFs.smk index f5b8176..e7b27fa 100644 --- a/workflow/rules/gatk_GenotypeGVCFs.smk +++ b/workflow/rules/gatk_GenotypeGVCFs.smk @@ -6,9 +6,9 @@ rule gatk_GenotypeGVCFs: protected("../results/called/{family}_raw_snps_indels.g.vcf") params: maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']), - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']), + tdir = config['TEMPDIR'], + padding = get_wes_padding_command, + intervals = get_wes_intervals_command, other = "-G StandardAnnotation -G AS_StandardAnnotation" log: "logs/gatk_GenotypeGVCFs/{family}.log" diff --git a/workflow/rules/gatk_HaplotypeCaller_cohort.smk b/workflow/rules/gatk_HaplotypeCaller_cohort.smk index 1c5e90a..220d6a7 100644 --- a/workflow/rules/gatk_HaplotypeCaller_cohort.smk +++ b/workflow/rules/gatk_HaplotypeCaller_cohort.smk @@ -8,9 +8,9 @@ rule gatk_HaplotypeCaller_cohort: index = temp("../results/called/{sample}_raw_snps_indels_tmp.g.vcf.idx") params: maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']), - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']), + tdir = config['TEMPDIR'], + padding = get_wes_padding_command, + intervals = get_wes_intervals_command, other = "-ERC GVCF" log: "logs/gatk_HaplotypeCaller_cohort/{sample}.log" diff --git a/workflow/rules/gatk_HaplotypeCaller_single.smk b/workflow/rules/gatk_HaplotypeCaller_single.smk index b4a2bf6..7056bfe 100644 --- a/workflow/rules/gatk_HaplotypeCaller_single.smk +++ b/workflow/rules/gatk_HaplotypeCaller_single.smk @@ -7,9 +7,9 @@ rule gatk_HaplotypeCaller_single: protected("../results/called/{sample}_raw_snps_indels.vcf") params: maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']), - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']) + tdir = config['TEMPDIR'], + padding = get_wes_padding_command, + intervals = get_wes_intervals_command log: "logs/gatk_HaplotypeCaller_single/{sample}.log" benchmark: diff --git a/workflow/rules/gatk_MarkDuplicates.smk b/workflow/rules/gatk_MarkDuplicates.smk index 0ddcc96..4992490 100644 --- a/workflow/rules/gatk_MarkDuplicates.smk +++ b/workflow/rules/gatk_MarkDuplicates.smk @@ -6,7 +6,7 @@ rule gatk_MarkDuplicates: metrics = "../results/mapped/{sample}_sorted_mkdups_metrics.txt" params: maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']), - tdir = expand("{tdir}", tdir = config['TEMPDIR']) + tdir = config['TEMPDIR'] log: "logs/gatk_MarkDuplicates/{sample}.log" benchmark: diff --git a/workflow/rules/multiqc.smk b/workflow/rules/multiqc.smk index a07113b..2dd35d2 100644 --- a/workflow/rules/multiqc.smk +++ b/workflow/rules/multiqc.smk @@ -1,6 +1,6 @@ rule multiqc: input: - expand("../results/qc/fastqc/{sample}_fastqc.zip", sample = SAMPLES) + expand(["../results/qc/fastqc/{sample}_1_fastqc.zip", "../results/qc/fastqc/{sample}_2_fastqc.zip"], sample = SAMPLES) output: report("../results/qc/multiqc_report.html", caption = "../report/quality_checks.rst", category = "Quality checks") conda: diff --git a/workflow/rules/multiqc_trim.smk b/workflow/rules/multiqc_trim.smk index e8fd7ab..3303280 100644 --- a/workflow/rules/multiqc_trim.smk +++ b/workflow/rules/multiqc_trim.smk @@ -1,6 +1,6 @@ rule multiqc: input: - fastqc = expand("../results/qc/fastqc/{sample}_fastqc.zip", sample = SAMPLES), + fastqc = expand(["../results/qc/fastqc/{sample}_1_fastqc.zip", "../results/qc/fastqc/{sample}_2_fastqc.zip"], sample = SAMPLES), trimming1 = expand("../results/trimmed/{sample}_1.fastq.gz_trimming_report.txt", sample = SAMPLES), trimming2 = expand("../results/trimmed/{sample}_2.fastq.gz_trimming_report.txt", sample = SAMPLES) output: diff --git a/workflow/rules/pbrun_fq2bam.smk b/workflow/rules/pbrun_fq2bam.smk deleted file mode 100644 index 36a6ed3..0000000 --- a/workflow/rules/pbrun_fq2bam.smk +++ /dev/null @@ -1,33 +0,0 @@ -if config['TRIM'] == "No" or config['TRIM'] == "no": - R1 = "../../fastq/{sample}_1.fastq.gz" - R2 = "../../fastq/{sample}_2.fastq.gz" - -if config['TRIM'] == "Yes" or config['TRIM'] == "yes": - R1 = "../results/trimmed/{sample}_1_val_1.fq.gz" - R2 = "../results/trimmed/{sample}_2_val_2.fq.gz" - -rule pbrun_fq2bam: - input: - R1 = R1, - R2 = R2, - refgenome = expand("{refgenome}", refgenome = config['REFGENOME']) - output: - bam = protected("../results/mapped/{sample}_recalibrated.bam"), - index = protected("../results/mapped/{sample}_recalibrated.bam.bai"), - recal = temp("../results/mapped/{sample}_recal.txt") - resources: - gpu = config['GPU'] - params: - readgroup = "--read-group-sm {sample}", - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']), - recalibration_resources = expand("{recalibration_resources}", recalibration_resources = config['RECALIBRATION']['RESOURCES']) - log: - "logs/pbrun_fq2bam/{sample}.log" - benchmark: - "benchmarks/pbrun_fq2bam/{sample}.tsv" - message: - "Generating a BAM output for {input.R1} and {input.R2} using BWA-Mem, gatk MarkDuplicates and gatk BaseRecalibrator" - shell: - "pbrun fq2bam --ref {input.refgenome} --in-fq {input.R1} {input.R2} {params.recalibration_resources} --out-bam {output.bam} --out-recal {output.recal} --num-gpus {resources.gpu} {params.readgroup} --tmp-dir {params.tdir} {params.padding} {params.intervals} &> {log}" diff --git a/workflow/rules/pbrun_germline.smk b/workflow/rules/pbrun_germline.smk new file mode 100644 index 0000000..3163d38 --- /dev/null +++ b/workflow/rules/pbrun_germline.smk @@ -0,0 +1,27 @@ +rule pbrun_germline: + input: + fastq = get_input_fastq, + refgenome = expand("{refgenome}", refgenome = config['REFGENOME']) + output: + bam = protected("../results/mapped/{sample}_recalibrated.bam"), + bam_index = protected("../results/mapped/{sample}_recalibrated.bam.bai"), + vcf = get_output_vcf(config), + recal = temp("../results/mapped/{sample}_recal.txt") + resources: + gpu = config['GPU'] + params: + readgroup = "--read-group-sm {sample}", + tdir = config['TEMPDIR'], + padding = get_wes_padding_command, + intervals = get_wes_intervals_command, + recalibration_resources = get_recal_resources_command, + other_params = get_params + log: + "logs/pbrun_germline/{sample}.log" + benchmark: + "benchmarks/pbrun_germline/{sample}.tsv" + threads: config['THREADS'] + message: + "Running GPU accelerated germline variant pipeline workflow to generate BAM, vcf and recal output for {input.fastq}" + shell: + "pbrun germline --ref {input.refgenome} --in-fq {input.fastq} {params.recalibration_resources} --out-bam {output.bam} --out-variants {output.vcf} --out-recal-file {output.recal} --num-gpus {resources.gpu} {params.readgroup} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.other_params} --num-cpu-threads {threads} &> {log}" \ No newline at end of file diff --git a/workflow/rules/pbrun_haplotypecaller_cohort.smk b/workflow/rules/pbrun_haplotypecaller_cohort.smk deleted file mode 100644 index 190ae8c..0000000 --- a/workflow/rules/pbrun_haplotypecaller_cohort.smk +++ /dev/null @@ -1,24 +0,0 @@ -rule pbrun_haplotypecaller_cohort: - input: - bam = "../results/mapped/{sample}_recalibrated.bam", - index = "../results/mapped/{sample}_recalibrated.bam.bai", - recal = "../results/mapped/{sample}_recal.txt", - refgenome = expand("{refgenome}", refgenome = config['REFGENOME']) - output: - gvcf = temp("../results/called/{sample}_raw_snps_indels_tmp.g.vcf"), - index = temp("../results/called/{sample}_raw_snps_indels_tmp.g.vcf.idx") - resources: - gpu = config['GPU'] - params: - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']), - other = "--gvcf" - log: - "logs/pbrun_haplotypecaller_gvcf/{sample}.log" - benchmark: - "benchmarks/pbrun_haplotypecaller_gvcf/{sample}.tsv" - message: - "Calling germline SNPs and indels via local re-assembly of haplotypes for {input.bam}" - shell: - "pbrun haplotypecaller --ref {input.refgenome} --in-bam {input.bam} --in-recal-file {input.recal} --out-variants {output.gvcf} --num-gpus {resources.gpu} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.other} &> {log}" \ No newline at end of file diff --git a/workflow/rules/pbrun_haplotypecaller_single.smk b/workflow/rules/pbrun_haplotypecaller_single.smk deleted file mode 100644 index 3e5630c..0000000 --- a/workflow/rules/pbrun_haplotypecaller_single.smk +++ /dev/null @@ -1,22 +0,0 @@ -rule pbrun_haplotypecaller_single: - input: - bam = "../results/mapped/{sample}_recalibrated.bam", - index = "../results/mapped/{sample}_recalibrated.bam.bai", - recal = "../results/mapped/{sample}_recal.txt", - refgenome = expand("{refgenome}", refgenome = config['REFGENOME']) - output: - protected("../results/called/{sample}_raw_snps_indels.vcf") - resources: - gpu = config['GPU'] - params: - tdir = expand("{tdir}", tdir = config['TEMPDIR']), - padding = expand("{padding}", padding = config['WES']['PADDING']), - intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']) - log: - "logs/pbrun_haplotypecaller/{sample}.log" - benchmark: - "benchmarks/pbrun_haplotypecaller/{sample}.tsv" - message: - "Calling germline SNPs and indels via local re-assembly of haplotypes for {input.bam}" - shell: - "pbrun haplotypecaller --ref {input.refgenome} --in-bam {input.bam} --in-recal-file {input.recal} --out-variants {output} --num-gpus {resources.gpu} --tmp-dir {params.tdir} {params.padding} {params.intervals} &> {log}" \ No newline at end of file diff --git a/workflow/rules/pbrun_triocombinegvcf.smk b/workflow/rules/pbrun_triocombinegvcf.smk new file mode 100644 index 0000000..5987edc --- /dev/null +++ b/workflow/rules/pbrun_triocombinegvcf.smk @@ -0,0 +1,18 @@ +rule pbrun_triocombinegvcf: + input: + vcf_dummy = expand("../results/called/{sample}_raw_snps_indels_tmp.g.vcf", sample = SAMPLES), # a dummy vcf to connect this rule to gatk_HaplotypeCaller + refgenome = expand("{refgenome}", refgenome = config['REFGENOME']) + output: + temp("../results/called/{family}_raw_snps_indels_tmp_combined.g.vcf") + resources: + gpu = config['GPU'] + params: + command = get_parabricks_combinegvcf_command + log: + "logs/pbrun_triocombinegvcf/{family}.log" + benchmark: + "benchmarks/pbrun_triocombinegvcf/{family}.tsv" + message: + "Merging one or more HaplotypeCaller GVCF files into a single GVCF" + shell: + "pbrun triocombinegvcf --ref {input.refgenome} {params.command} --out-variants {output} &> {log}" \ No newline at end of file diff --git a/workflow/rules/trim_galore_pe.smk b/workflow/rules/trim_galore_pe.smk index b56ba70..c3f0bb4 100644 --- a/workflow/rules/trim_galore_pe.smk +++ b/workflow/rules/trim_galore_pe.smk @@ -7,11 +7,11 @@ rule trim_galore_pe: temp("../results/trimmed/{sample}_2_val_2.fq.gz"), "../results/trimmed/{sample}_2.fastq.gz_trimming_report.txt" params: - adapters = expand("{adapters}", adapters = config['TRIMMING']['ADAPTERS']), - threads = expand("{threads}", threads = config['THREADS']), + adapters = config['TRIMMING']['ADAPTERS'], + threads = config['THREADS'], other = "-q 20 --paired" log: - "logs/trim_galore/{sample}.log" + "logs/trim_galore_pe/{sample}.log" benchmark: "benchmarks/trim_galore_pe/{sample}.tsv" conda: diff --git a/workflow/run.sh b/workflow/run.sh index 16cf940..3b8b621 100644 --- a/workflow/run.sh +++ b/workflow/run.sh @@ -6,4 +6,5 @@ snakemake \ --resources gpu=2 \ --use-conda \ --conda-frontend mamba \ +--latency-wait 120 \ --configfile ../config/config.yaml \ No newline at end of file diff --git a/workflow/run_hpc.sh b/workflow/run_hpc.sh index c6318cc..eabb0f6 100644 --- a/workflow/run_hpc.sh +++ b/workflow/run_hpc.sh @@ -6,7 +6,9 @@ snakemake \ --resources gpu=2 \ --use-conda \ --conda-frontend mamba \ +--latency-wait 120 \ --configfile ../config/config.yaml \ --cluster-config ../config/cluster.json \ --cluster "sbatch -A {cluster.account} \ --p {cluster.partition}" \ No newline at end of file +-p {cluster.partition} \ +-o {cluster.output}" \ No newline at end of file