From 7eddaee6a50bdda2f66a88b806951137f5113f07 Mon Sep 17 00:00:00 2001 From: Miguel Brown Date: Wed, 11 Sep 2024 09:25:12 -0400 Subject: [PATCH 1/4] :pencil: update README and doc to clarify and fix broken links --- README.md | 13 ++++++++++++- ...rc-jointgenotyping-refinement-workflow.cwl | 19 +++++++++++++++---- 2 files changed, 27 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 1ff450e..04b9683 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # Kids First DRC Joint Genotyping Workflow -Kids First Data Resource Center Joint Genotyping Workflow (cram-to-deNovoGVCF). Cohort sample variant calling and genotype refinement. +Kids First Data Resource Center Joint Genotyping Workflow (cram-to-deNovoGVCF). **_Small_** Cohort sample variant calling and genotype refinement. +This workflow is intended for family cohort calling, typically mother-father-proband trios. +If you wish to run on a larger cohort, please see our [Kids First-Sentieon Joint Cohort Calling](https://github.com/kids-first/Kids-First-Sentieon-Joint-Cohort-Genotyping-Workflow) workflow. Note: The DNA annotation has been significantly upgraded since v2.2.3, if you'd like to use the old version, revert to that release. Using existing gVCFs, likely from GATK Haplotype Caller, we follow this workflow: [Germline short variant discovery (SNPs + Indels)](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels), to create family joint calling and joint trios (typically mother-father-child) variant calls. Peddy is run to raise any potential issues in family relation definitions and sex assignment. @@ -45,6 +47,15 @@ This workflow is the current production workflow, equivalent to this [CAVATICA p - CADDv1.6-38-whole_genome_SNVs.tsv.gz - clinvar_20220507_chr.vcf.gz +## Outputs +Note: Not all outputs are available from the Kids First Portal. If there is an output that you'd like that is produced by the workflow that is not available, please contact support@kidsfirstdrc.org. + - `collectvariantcallingmetrics`: Variant calling summary and detailed metrics files + - `peddy_html`: html summary of peddy results + - `peddy_ped`: ped format summary of peddy results + - `cgp_vep_annotated_vcf`: Variant Effect Predictor annotated VCF files. File suffix tyically `.multi.vqsr.filtered.denovo.vep_105.vcf.gz` .Contains joint calls with the following: + - `lowGQ` FILTER `GQ < 20.0`. + - Genotype posterior probabilities. See [here](https://gatk.broadinstitute.org/hc/en-us/articles/360037226592-CalculateGenotypePosteriors) for an explanation + - INFO tags of `hiConfDeNovo`, `loConfDeNovo`. See [here](https://gatk.broadinstitute.org/hc/en-us/articles/4409924802331-PossibleDeNovo) for more info ## Import info on cloning the git repo This repo takes advantage of the git submodule feature. diff --git a/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl b/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl index ea27ea6..e6a0d92 100644 --- a/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl +++ b/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl @@ -4,10 +4,12 @@ id: kfdrc-jointgenotyping-refinement-workflow label: Kids First DRC Joint Genotyping Workflow doc: | # Kids First DRC Joint Genotyping Workflow - Kids First Data Resource Center Joint Genotyping Workflow (cram-to-deNovoGVCF). Cohort sample variant calling and genotype refinement. + Kids First Data Resource Center Joint Genotyping Workflow (cram-to-deNovoGVCF). **_Small_** Cohort sample variant calling and genotype refinement. + This workflow is intended for family cohort calling, typically mother-father-proband trios. + If you wish to run on a larger cohort, please see our [Kids First-Sentieon Joint Cohort Calling](https://github.com/kids-first/Kids-First-Sentieon-Joint-Cohort-Genotyping-Workflow) workflow. Note: The DNA annotation has been significantly upgraded since v2.2.3, if you'd like to use the old version, revert to that release. - Using existing gVCFs, likely from GATK Haplotype Caller, we follow this workflow: [Germline short variant discovery (SNPs + Indels)](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145), to create family joint calling and joint trios (typically mother-father-child) variant calls. Peddy is run to raise any potential issues in family relation definitions and sex assignment. + Using existing gVCFs, likely from GATK Haplotype Caller, we follow this workflow: [Germline short variant discovery (SNPs + Indels)](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels), to create family joint calling and joint trios (typically mother-father-child) variant calls. Peddy is run to raise any potential issues in family relation definitions and sex assignment. If you would like to run this workflow using the CAVATICA public app, a basic primer on running public apps can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-Cavatica-af5ebb78c38a4f3190e32e67b4ce12bb). Alternatively, if you'd like to run it locally using `cwltool`, a basic primer on that can be found [here](https://www.notion.so/d3b/Starting-From-Scratch-Running-CWLtool-b8dbbde2dc7742e4aff290b0a878344d) and combined with app-specific info from the readme below. @@ -21,7 +23,7 @@ doc: | ### Tips To Run: 1. inputs vcf files are the gVCF files from GATK Haplotype Caller, need to have the index **.tbi** files copy to the same project too. 1. If you are experiencing issues with Variant Recalibration either in VariantRecalibrator or ApplyVQSR, consider adjusting the max_gaussians. If a dataset gives fewer variants than the expected scale, the number of Gaussians for training should be turned down. Lowering the max-Gaussians forces the program to group variants into a smaller number of clusters, which results in more variants per cluster. - 1. ped file in the input shows the family relationship between samples, the format should be the same as in GATK website [link](https://gatkforums.broadinstitute.org/gatk/discussion/7696/pedigree-ped-files), the Individual ID, Paternal ID and Maternal ID must be the same as in the inputs vcf files header. + 1. ped file in the input shows the family relationship between samples, the format should be the same as in GATK website [link](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format), the Individual ID, Paternal ID and Maternal ID must be the same as in the inputs vcf files header. 1. Here we recommend to use GRCh38 as reference genome to do the analysis, positions in gVCF should be GRCh38 too. 1. Reference locations: - Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0 @@ -50,6 +52,15 @@ doc: | - CADDv1.6-38-whole_genome_SNVs.tsv.gz - clinvar_20220507_chr.vcf.gz + ## Outputs + Note: Not all outputs are available from the Kids First Portal. If there is an output that you'd like that is produced by the workflow that is not available, please contact support@kidsfirstdrc.org. + - `collectvariantcallingmetrics`: Variant calling summary and detailed metrics files + - `peddy_html`: html summary of peddy results + - `peddy_ped`: ped format summary of peddy results + - `cgp_vep_annotated_vcf`: Variant Effect Predictor annotated VCF files. File suffix tyically `.multi.vqsr.filtered.denovo.vep_105.vcf.gz` .Contains joint calls with the following: + - `lowGQ` FILTER `GQ < 20.0`. + - Genotype posterior probabilities. See [here](https://gatk.broadinstitute.org/hc/en-us/articles/360037226592-CalculateGenotypePosteriors) for an explanation + - INFO tags of `hiConfDeNovo`, `loConfDeNovo`. See [here](https://gatk.broadinstitute.org/hc/en-us/articles/4409924802331-PossibleDeNovo) for more info ## Import info on cloning the git repo This repo takes advantage of the git submodule feature. @@ -379,5 +390,5 @@ hints: - VCF - VEP "sbg:links": -- id: 'https://github.com/kids-first/kf-jointgenotyping-workflow/releases/tag/v2.4.0' +- id: 'https://github.com/kids-first/kf-jointgenotyping-workflow/releases/tag/v2.4.1' label: github-release From 789e14e3641f4a1924d62ae3490af419720d7845 Mon Sep 17 00:00:00 2001 From: Miguel Brown Date: Tue, 17 Sep 2024 11:41:02 -0400 Subject: [PATCH 2/4] :pencil: added blurb about WGS vs WXS --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 04b9683..18929d5 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@ Kids First Data Resource Center Joint Genotyping Workflow (cram-to-deNovoGVCF). **_Small_** Cohort sample variant calling and genotype refinement. This workflow is intended for family cohort calling, typically mother-father-proband trios. If you wish to run on a larger cohort, please see our [Kids First-Sentieon Joint Cohort Calling](https://github.com/kids-first/Kids-First-Sentieon-Joint-Cohort-Genotyping-Workflow) workflow. +Furthermore, in its current state, it follows best practices for WGS input only. +While WXS data could be run, parameters are not currently optimized for that, but a planned update will allow for appropriate defaults to be set for either input type. Note: The DNA annotation has been significantly upgraded since v2.2.3, if you'd like to use the old version, revert to that release. Using existing gVCFs, likely from GATK Haplotype Caller, we follow this workflow: [Germline short variant discovery (SNPs + Indels)](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels), to create family joint calling and joint trios (typically mother-father-child) variant calls. Peddy is run to raise any potential issues in family relation definitions and sex assignment. From c8d5942604be02bcaa9b80c475a3940d4847c76d Mon Sep 17 00:00:00 2001 From: Miguel Brown Date: Tue, 17 Sep 2024 11:41:06 -0400 Subject: [PATCH 3/4] :pencil: added blurb about WGS vs WXS --- workflow/kfdrc-jointgenotyping-refinement-workflow.cwl | 2 ++ 1 file changed, 2 insertions(+) diff --git a/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl b/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl index e6a0d92..2b4b4ee 100644 --- a/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl +++ b/workflow/kfdrc-jointgenotyping-refinement-workflow.cwl @@ -7,6 +7,8 @@ doc: | Kids First Data Resource Center Joint Genotyping Workflow (cram-to-deNovoGVCF). **_Small_** Cohort sample variant calling and genotype refinement. This workflow is intended for family cohort calling, typically mother-father-proband trios. If you wish to run on a larger cohort, please see our [Kids First-Sentieon Joint Cohort Calling](https://github.com/kids-first/Kids-First-Sentieon-Joint-Cohort-Genotyping-Workflow) workflow. + Furthermore, in its current state, it follows best practices for WGS input only. + While WXS data could be run, parameters are not currently optimized for that, but a planned update will allow for appropriate defaults to be set for either input type. Note: The DNA annotation has been significantly upgraded since v2.2.3, if you'd like to use the old version, revert to that release. Using existing gVCFs, likely from GATK Haplotype Caller, we follow this workflow: [Germline short variant discovery (SNPs + Indels)](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels), to create family joint calling and joint trios (typically mother-father-child) variant calls. Peddy is run to raise any potential issues in family relation definitions and sex assignment. From 21c88dbdc1740e64ab6b05a7106da61b5c30423c Mon Sep 17 00:00:00 2001 From: Miguel Brown Date: Tue, 17 Sep 2024 11:54:01 -0400 Subject: [PATCH 4/4] Update README.md Co-authored-by: Dan Miller --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 18929d5..9204596 100644 --- a/README.md +++ b/README.md @@ -54,7 +54,7 @@ Note: Not all outputs are available from the Kids First Portal. If there is an o - `collectvariantcallingmetrics`: Variant calling summary and detailed metrics files - `peddy_html`: html summary of peddy results - `peddy_ped`: ped format summary of peddy results - - `cgp_vep_annotated_vcf`: Variant Effect Predictor annotated VCF files. File suffix tyically `.multi.vqsr.filtered.denovo.vep_105.vcf.gz` .Contains joint calls with the following: + - `cgp_vep_annotated_vcf`: Variant Effect Predictor annotated VCF files. File suffix tyically `.multi.vqsr.filtered.denovo.vep_105.vcf.gz`. Contains joint calls with the following: - `lowGQ` FILTER `GQ < 20.0`. - Genotype posterior probabilities. See [here](https://gatk.broadinstitute.org/hc/en-us/articles/360037226592-CalculateGenotypePosteriors) for an explanation - INFO tags of `hiConfDeNovo`, `loConfDeNovo`. See [here](https://gatk.broadinstitute.org/hc/en-us/articles/4409924802331-PossibleDeNovo) for more info