- Run pipeface on NCI
git clone https://github.com/leahkemp/pipeface.git
cd pipeface
Note: Variant annotation is only available for hg38
Get a copy of the hg38 reference genome
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.fa.gz .
Check download was successful by checking md5sum
md5sum hg38.analysisSet.fa.gz
Expected md5sum
6d3c82e1e12b127d526395294526b9c8 hg38.analysisSet.fa.gz
gunzip and build index
gunzip hg38.analysisSet.fa.gz
module load samtools/1.19
samtools faidx hg38.analysisSet.fa
Get a copy of the hs1 reference genome
rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/hs1.fa.gz .
Check download was successful by checking md5sum
md5sum hs1.fa.gz
Expected md5sum
a493d5402cc86ecc3f54f6346d980036 hs1.fa.gz
gunzip and build index
gunzip hs1.fa.gz
module load samtools/1.19
samtools faidx hs1.fa
Clone the Rerio github repository
git clone https://github.com/nanoporetech/rerio
Get a copy of the clair3 models
python3 rerio/download_model.py --clair3
Get a copy of the clair3 models
wget http://www.bio8.cs.hku.hk/clair3/clair3_models/hifi_revio.tar.gz
Untar
tar -xvf hifi_revio.tar.gz
Get a local copy of the mosdepth v0.3.9 binary
wget https://github.com/brentp/mosdepth/releases/download/v0.3.9/mosdepth -O mosdepth_0.3.9
chmod +x mosdepth_0.3.9
Get a local copy of the pb-CpG-tools v2.3.2 binary
wget https://github.com/PacificBiosciences/pb-CpG-tools/releases/download/v2.3.2/pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu.tar.gz
tar -xzf pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu.tar.gz
Specify the sample ID, family ID (optional), file path to the data, data type, file path to regions of interest bed file (optional) and file path to clair3 model (if running Clair3) for each data to be analysed. Eg:
sample_id,family_id,file,data_type,regions_of_interest,clair3_model
sample_01,,/g/data/kr68/test_data/PGXXXX240090_minimal.fastq.gz,ont,/g/data/kr68/genome/ReadFish_v9_gene_targets.collapsed.hg38.bed,/g/data/kr68/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_01,,/g/data/kr68/test_data/PGXXXX240091_minimal.fastq.gz,ont,/g/data/kr68/genome/ReadFish_v9_gene_targets.collapsed.hg38.bed,/g/data/kr68/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_02,,/g/data/kr68/test_data/PGXXXX240092_minimal.fastq,ont,/g/data/kr68/genome/ReadFish_v9_gene_targets.collapsed.hg38.bed,/g/data/kr68/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_03,,/g/data/kr68/test_data/PGXXOX240065_minimal.bam,ont,NONE,/g/data/kr68/clair3_models/ont/r1041_e82_400bps_sup_v420/
sample_04,,/g/data/kr68/test_data/m84088_240403_023825_s1.hifi_reads.bc2034_minimal.fastq,pacbio,NONE,/g/data/kr68/clair3_models/hifi_revio/
sample_04,,/g/data/kr68/test_data/m84088_240403_043745_s2.hifi_reads.bc2035_minimal.fastq,pacbio,NONE,/g/data/kr68/clair3_models/hifi_revio/
Requirements:
- leave
family_id
empty if not required family_id
is currently only used to organise the output files into subdirectories offamily_id
(if provided). Please provide all entries for a givensample_id
the samefamily_id
(this is currently not error checked)- set
regions_of_interest
to 'NONE' if not required - similarly, set
clair3_model
to 'NONE' if not required (ie. if you have not selected clair3 as the SNP/indel caller) - provide full file paths
- multiple entries for a given
sample_id
are required to have the same file extension in thefile
column (eg. '.bam', '.fastq.gz' or '.fastq') - for entries in the
file
column, the file extension must be either '.bam', '.fastq.gz' or '.fastq' (as appropriate) - for entries in the
file
column, files containing methylation data should be provided in uBAM format (and not FASTQ format) - entries in the
data_type
column must be either 'ont' or 'pacbio' (as appropriate)
Modify the NCI project to which to charge the analysis. Eg:
project = 'kr68'
Modify access to project specific directories. Eg:
storage = 'gdata/if89+gdata/xy86+scratch/kr68+gdata/kr68+gdata/ox63'
Note: Don't remove access to if89 gdata (
gdata/if89
) and xy86 gdata (gdata/xy86
). These are required to access environmental modules and variant annotation databases used in the pipeline
Specify the path to in_data.csv
. Eg:
"in_data": "./config/in_data.csv",
Specify the input data format ('ubam_fastq'). Eg:
"in_data_format": "ubam_fastq",
Specify the path to the reference genome and it's index. Eg:
"ref": "./hg38.fa",
"ref_index": "./hg38.fa.fai",
OR
"ref": "./hs1.fa",
"ref_index": "./hs1.fa.fai",
Optionally specify the path to the tandem repeat bed file. Set to 'NONE' if not required. Eg:
"tandem_repeat": "./hg38.analysisSet.trf.bed",
OR
"tandem_repeat": "NONE"
Specify the SNP/indel caller to use ('clair3' or 'deepvariant'). Eg:
"snp_indel_caller": "clair3",
OR
"snp_indel_caller": "deepvariant",
Note: Running DeepVariant on ONT data assumes r10 data
Specify the SV caller to use ('sniffles', 'cutesv' or 'both'). Eg:
"sv_caller": "sniffles",
OR
"sv_caller": "cutesv",
OR
"sv_caller": "both",
Specify whether variant annotation should be carried out ('yes' or 'no'). Eg:
"annotate": "yes",
OR
"annotate": "no",
Note: variant annotation is only available for hg38
Specify whether alignment depth should be calculated ('yes' or 'no'). Eg:
"calculate_depth": "yes",
OR
"calculate_depth": "no",
Specify the directory in which to write the pipeline outputs (please provide a full path). Eg:
"outdir": "/g/data/ox63/results"
Specify the path to the mosdepth binary (if running depth calculation). Eg:
"mosdepth_binary": "./mosdepth_0.3.9"
OR
"mosdepth_binary": "NONE"
Specify the path to the pb-CpG-tools binary (if processing pacbio data). Eg:
"pbcpgtools_binary": "./pb-CpG-tools-v2.3.2-x86_64-unknown-linux-gnu/"
OR
"pbcpgtools_binary": "NONE"
You may use the centrally installed nextflow environmental module available on NCI to access the nextflow and java dependencies
module load nextflow/24.04.1
nextflow run pipeface.nf -stub -params-file ./config/parameters_pipeface.json -config ./config/nextflow_pipeface.config
nextflow run pipeface.nf -params-file ./config/parameters_pipeface.json -config ./config/nextflow_pipeface.config -with-timeline -with-dag -with-report
The resources requested and the queue each process is submitted to may be modified by modifying ./config/nextflow_pipeface.config
.
Similarly, with some coding skills, the environmental modules used by each process in the pipeline may be modified. This means you're able to substitute in different versions of software used by the pipeline. However, keep in mind that the pipeline doesn't account for differences in parameterisation between software versions.
This also means this pipeline is adaptable to other HPC's if appropriate environmental modules are included in ./config/nextflow_pipeface.config
(or if you get around to creating a nextflow configuration file pointing to appropriate containerised software before I do) and modify the job scheduler specific configuration if needed. If you wish to use the variant annotation component of the pipeline, you'll additionally need to create local copies of the variant annotation databases used by the pipeline.