This workflow takes a set of gVCFs and an indexed reference FASTA to create a joint-called cohort VCF.
The input gVCFs are sharded by chromosome. To accelerate the process, the sharding is done across 60 instances, each of which is processing 12 files at a time.
Next, gVCF shards that share a common chromosome are gathered and passed to Sentieon GVCFtyper, producing one joint VCF per chromosome.
If the user supplies a dbSNP vcf at run time, the outputs will also be annotated with rs_
IDs in the ID
column of the VCF.
Two main limitations currently (as of May 2024) exist for running this cohort workflow on CAVATICA:
- Cohort Size
- File input size
Cohort size must be less than ~2200 samples to run, otherwise the platform will will quit immediately with error:
The scheduled job root is too large to be executed. Please contact support@sbgenomics.com for further assistance if needed.
CAVATICA instances, by default, start with 1TB of EBS storage. The storage can be increased through the Execution Settings
all the way up to 4TB.
If 4TB is insufficient to process your cohort, contact CAVATICA support for assistance. Alternatively, try refining the cohort or limiting the sites of interest.
reference
: Indexed FASTA file reference. Should be the same one used to create the input gVCFsinput_vcf
: Indexed gVCF files to perform cohort calling on. It's recommended that the GATK Haplotyper caller workflow or Sentieon equivalent has been used to generate each of these inputs. If gVCFs were obtained from Gabriella Miller Kids First (GMKF), it will meet this recommendation.sentieon_license
: Sentieon license server host and port in format0.0.0.0:0000
. Is set by default by the workflow, but can be changed if circumstances require it.
dbSNP
: Indexed dbSNP file to add common variant annotation ifID
field
fai_subset
: Parameter used to set which chromosome are to be used based on the FASTA index file. It's generally recommended to use the canonical chromosomes (chr1-22,X,Y). By default, this parameter is set to24
to capture those first 24 chromosomes of the FASTA index.bcftools_cpu
: Dictates the amount of stacking to perform for gVCF split steps. By default set to 4, so number of concurrent files per instance is the number of cores in the instance/bcftools_cpugvcf_typer_cpus
: Number of cpus each Sentieon GVCFtyper job should attempt to use. Default for workflow is 48. Adjust this if you change the instance type in the Execution Settings at task setupgvcf_typer_mem
: Amount of RAM in GB each Sentieon GVCFtyper job should attempt to use. Default for workflow is 48, but in practice will likely be less than 30. Adjust this if you change the instance type in the Execution Settings at task setupcall_conf
: Call confidence level (default: 30)emit_conf
: Emit confidence level (default: 30)genotype_model
: "coalescent" or "multinomial", default: multinomial. While the coalescent mode is theoretically more accurate for smaller cohorts, the multinomial mode is equally accurate with large cohorts and scales better with a very large number of samples.output_file_prefix
: Since outputs will be joint calls split by chromosome, a user-defined convenience to prefix each file output with a string of their choice Default is "joint_call", so output files would be named "joint_call_chr1.vcf.gz", "joint_call_chr2.vcf.gz", etc.
joint_called_by_chr_vcf
: Array of joint-called files split by chromosome
- The default 1TB storage per instance might be enough for up to a 1500 sample cohort size. To be safe, set this to at least 2TB if not more for larger cohorts in the task, documentation on this here: https://docs.sevenbridges.com/docs/set-execution-hints-at-task-level. An example would be to use the following:
- Instance type:
c5.12xlarge
- EBS storage:
2048
up to4096
- Number of parallel instances:
60
. Most users have an80
max limit per account
- Instance type: