prettier changes

bacterial-genomics · Nov 15, 2024 · c223b7c · c223b7c
1 parent 6ed010a
commit c223b7c
Show file tree

Hide file tree

Showing 9 changed files with 112 additions and 45 deletions.
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -21,7 +21,11 @@
             },
 
             // Add the IDs of extensions you want installed when the container is created.
-            "extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"]
+            "extensions": [
+                "ms-python.python",
+                "ms-python.vscode-pylance",
+                "nf-core.nf-core-extensionpack"
+            ]
         }
     }
 }
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,45 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## v3.0.0 - November 15, 2024
+
+### `Added`
+
+- Consistent metrics reported for each read cleaning step (@chrisgulvik)
+- Added SeqFu for FastQ format validation (@chrisgulvik)
+- Checksum (SHA-512) reporting of intermediate and output files (@chrisgulvik)
+- Report full input paths for each sample (@chrisgulvik)
+- For assembly depth reporting, added stdev depth metrics; added total paired+single mapped stats (@chrisgulvik)
+
+### `Changed`
+
+- Default uses SeqKit rather than SeqTk for downsampling (@chrisgulvik)
+- Output structure and filenames revised (@chrisgulvik)
+- For MLST, exclude all MLST databases with a \*\_<int> by default (> 1) to ensure the original MLST database version is used for each taxon (e.g., excludes leptospira_2 and leptospira_3) and avoids inconsistent versions used within a run which would occasionally give one sample a leptospira and a different sample leptospira_3 making it impossible to immediately compare between samples. (@chrisgulvik)
+- For MLST, store novel FastA when that situation occurs (@chrisgulvik)
+- Sample name in outputs and file content no longer contains assembler name (@chrisgulvik)
+- Changed RDP output to exclude unneccesary data columns such as "Phylum\nphylum", "Genus\ngenus" (@chrisgulvik)
+- Use both R1 and R2 and only Phred30 for estimate bp input for more accurate estimation of genome size (@chrisgulvik)
+- Changed default to always on to store stats and FastA of discarded contigs during biopython filtering (@chrisgulvik)
+- Output filenames within `pipeline_info/` changed to show month by name and include day of the week (@chrisgulvik)
+
+### `Fixed`
+
+- Order of operations in Trimmomatic process now ensures final output reads have minimum sequence length (default: 50 bp) (@chrisgulvik)
+- Fixed issue with missing column header names in the .kraken_summary.tsv output files (@chrisgulvik)
+- Fixed trailing tab character in Kraken1 and Kraken2 output TSV summaries, which made pandas XLSX conversion fail due to different column numbers in header and data (@chrisgulvik)
+- Fixed VERSION reporting RDP bug by removing spaces (@chrisgulvik)
+
+### `Updated`
+
+- Coloring of workflow process now corresponds to tab color in XLSX output summary sheet (@chrisgulvik)
+- Docker container version updates (@chrisgulvik)
+- Updated description on output files based on new files created as well as some renamed output files (@chrisgulvik)
+
+### `Deprecated`
+
+- Removed gene calling from QUAST output summary (@chrisgulvik)
+
 ## v2.4.0 - August 28, 2024
 
 ### `Added`
@@ -19,7 +58,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### `Updated`
 
-- TSV output data files have header names with no spaces, all underscores replaced them
+- TSV output data files have header names with no spaces, all underscores replaced them (@chrisgulvik)
 
 ### `Deprecated`
 

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -67,6 +67,7 @@
   > Constantinides B, Hunt M, Crook DW. Hostile: accurate decontamination of microbial host sequences. Dec 1 2023;39(12):btad728. doi: 10.1093/bioinformatics/btad728
 
 - [Kraken](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/)
+
   > Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology. 2014/03/03 2014;15(3):R46. doi:10.1186/gb-2014-15-3-r46
 
 - [Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)
@@ -130,6 +131,7 @@
   > Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. May 2012;19(5):455-77. doi:10.1089/cmb.2012.0021
 
 - [SPAdes latest](https://pubmed.ncbi.nlm.nih.gov/32559359/)
+
   > Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics. Jun 2020;70(1):e102. doi: 10.1002/cpbi.102
 
 - [Trimmomatic](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/)

diff --git a/README.md b/README.md
@@ -181,8 +181,7 @@ PhiX reference [NC_001422.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_001422.1) c
                         [Default: NaN]
 ```
 
-> [!NOTE]
-> _If user does not specify inputs for parameters with a default set to `NaN`, these options will not be performed during workflow analysis._
+> [!NOTE] > _If user does not specify inputs for parameters with a default set to `NaN`, these options will not be performed during workflow analysis._
 
 ### Additional parameters
 
@@ -201,9 +200,9 @@ nextflow run \
 The most well-tested and supported is a Univa Grid Engine (UGE) job scheduler with Singularity for dependency handling.
 
 1. UGE/SGE
-    - Additional tips for UGE processing are [here](docs/HPC-UGE-scheduler.md).
+   - Additional tips for UGE processing are [here](docs/HPC-UGE-scheduler.md).
 2. No Scheduler
-    - It has also been confirmed to work on desktop and laptop environments without a job scheduler using Docker with more tips [here](docs/local-device.md).
+   - It has also been confirmed to work on desktop and laptop environments without a job scheduler using Docker with more tips [here](docs/local-device.md).
 
 ## Output
 

diff --git a/docs/ADD_MODULE_GUIDE.md b/docs/ADD_MODULE_GUIDE.md
@@ -17,15 +17,15 @@ If you're not used to this workflow with git, you can start with some [docs from
 The first step is to fork the [wf-paired-end-illumina-workflow](https://github.com/bacterial-genomics/wf-paired-end-illumina-assembly) repository:
 
 1. On the [GitHub repository](https://github.com/bacterial-genomics/wf-paired-end-illumina-assembly) in the top right corner, click **Fork**.
-    ![GitHub fork](images/github_fork.PNG)
+   ![GitHub fork](images/github_fork.PNG)
 2. Under "Owner", select the dropdown menu and click and owner for the forked repository.
 3. By default, forks are named the same as their upstream repositories. Optionally, to further distinguish your fork, in the "Repository name" field, type a name.
 4. Unselect "Copy the `main` branch only. The new module should be added to the `dev` branch of the workflow.
 5. Click **Create fork**.
 6. Then clone your forked repository:
-    `git clone https://github.com/YOURUSERNAME/wf-paired-end-illumina-assembly.git`
+   `git clone https://github.com/YOURUSERNAME/wf-paired-end-illumina-assembly.git`
 7. Then create a new branch on your forked repository:
-    `git checkout -b NEWBRANCHNAME`
+   `git checkout -b NEWBRANCHNAME`
 
 Please create a new branch with the appropriate branch name for the module you are trying to add. This will make things easier when reviewing and ultimately merging the branches on the repository.
 

diff --git a/docs/output.md b/docs/output.md
@@ -344,7 +344,7 @@ The final assembly file is scanned against PubMLST typing schemes to determine t
 <summary>MLST output interpretation</summary>
 
 | Symbol | Meaning                               | Length          | Identity       |
-|--------|---------------------------------------|-----------------|----------------|
+| ------ | ------------------------------------- | --------------- | -------------- |
 | `n`    | exact intact allele                   | 100%            | 100%           |
 | `~n`   | novel full length allele similar to n | 100%            | &ge; `--minid` |
 | `n?`   | partial match to known allele         | &ge; `--mincov` | &ge; `--minid` |
@@ -414,6 +414,7 @@ The GenBank file is parsed for 16S rRNA gene records (with BioPython). If there
   - `[assembler].16S_top_species_BLAST.tsv`: Summary of the best BLAST alignment for each sample.
 
 - `SSU/BLAST/`
+
   - `[sample]-[assembler].blast.tsv.gz`: Full, not yet bitscore sorted, BLASTn output for each 16S rRNA gene record in tab-separated value (TSV) format using the BLAST outfmt 6 standard with additional taxonomy fields
 
 - `SSU/RDP/`

diff --git a/docs/usage.md b/docs/usage.md
@@ -54,7 +54,7 @@ CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
 ```
 
 | Column    | Description                                                                                                                                                                            |
-|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `sample`  | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
 | `fastq_1` | Full path to FastQ file for Illumina short reads 1. File must be gzipped and have the extension ".fastq.gz" or ".fq.gz".                                                               |
 | `fastq_2` | Full path to FastQ file for Illumina short reads 2. File must be gzipped and have the extension ".fastq.gz" or ".fq.gz".                                                               |
@@ -239,35 +239,35 @@ The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementatio
 2. Find the latest version of the Biocontainer available on [Quay.io](https://quay.io/repository/biocontainers/pangolin?tag=latest&tab=tags)
 3. Create the custom config accordingly:
 
-    - For Docker:
-
-    ```nextflow
-    process {
-        withName: PANGOLIN {
-            container = 'quay.io/biocontainers/pangolin:3.0.5--pyhdfd78af_0'
-        }
-    }
-    ```
-
-    - For Singularity:
-
-    ```nextflow
-    process {
-        withName: PANGOLIN {
-            container = 'https://depot.galaxyproject.org/singularity/pangolin:3.0.5--pyhdfd78af_0'
-        }
-    }
-    ```
-
-    - For Conda:
-
-    ```nextflow
-    process {
-        withName: PANGOLIN {
-            conda = 'bioconda::pangolin=3.0.5'
-        }
-    }
-    ```
+   - For Docker:
+
+   ```nextflow
+   process {
+       withName: PANGOLIN {
+           container = 'quay.io/biocontainers/pangolin:3.0.5--pyhdfd78af_0'
+       }
+   }
+   ```
+
+   - For Singularity:
+
+   ```nextflow
+   process {
+       withName: PANGOLIN {
+           container = 'https://depot.galaxyproject.org/singularity/pangolin:3.0.5--pyhdfd78af_0'
+       }
+   }
+   ```
+
+   - For Conda:
+
+   ```nextflow
+   process {
+       withName: PANGOLIN {
+           conda = 'bioconda::pangolin=3.0.5'
+       }
+   }
+   ```
 
 > [!NOTE]
 > If you wish to periodically update individual tool-specific results (e.g., Pangolin) generated by the pipeline then you must ensure to keep the `work/` directory otherwise the `-resume` ability of the pipeline will be compromised and it will restart from scratch.

diff --git a/modules/local/assess_assembly_checkm2/README.md b/modules/local/assess_assembly_checkm2/README.md
@@ -9,9 +9,10 @@ This process uses [CheckM2](https://github.com/chklovski/CheckM2) published in [
 ## How CheckM2 works
 
 From [CheckM2's documentation](https://github.com/chklovski/CheckM2):
+
 > CheckM2 uses two distinct machine learning models to predict genome completeness.
 >
 > - The 'general' gradient boost model is able to generalize well and is intended to be used on organisms not well represented in GenBank or RefSeq (roughly, when an organism is novel at the level of order, class or phylum).
 > - The 'specific' neural network model is more accurate when predicting completeness of organisms more closely related to the reference training set (roughly, when an organism belongs to a known species, genus or family).
-> CheckM2 uses a cosine similarity calculation to automatically determine the appropriate completeness model for each input genome, but you can also force the use of a particular completeness model, or get the prediction outputs for both.
-> There is only one contamination model (based on gradient boost) which is applied regardless of taxonomic novelty and works well across all cases.
+>   CheckM2 uses a cosine similarity calculation to automatically determine the appropriate completeness model for each input genome, but you can also force the use of a particular completeness model, or get the prediction outputs for both.
+>   There is only one contamination model (based on gradient boost) which is applied regardless of taxonomic novelty and works well across all cases.
diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -438,14 +438,28 @@
                     "default": "rRNA",
                     "hidden": true,
                     "description": "GenBank feature type to search in genbank file for extraction.",
-                    "enum": ["CDS", "gene", "rRNA", "source", "tRNA", "misc_feature"]
+                    "enum": [
+                        "CDS",
+                        "gene",
+                        "rRNA",
+                        "source",
+                        "tRNA",
+                        "misc_feature"
+                    ]
                 },
                 "genbank_query_qualifier": {
                     "type": "string",
                     "default": "product",
                     "hidden": true,
                     "description": "Qualifier term within each genbank feature to search in genbank file for extraction.",
-                    "enum": ["gene", "inference", "locus_tag", "old_locus_tag", "product", "translation"]
+                    "enum": [
+                        "gene",
+                        "inference",
+                        "locus_tag",
+                        "old_locus_tag",
+                        "product",
+                        "translation"
+                    ]
                 },
                 "genbank_search_type": {
                     "type": "string",
@@ -888,7 +902,14 @@
                     "description": "Method used to save pipeline results to output directory.",
                     "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.",
                     "fa_icon": "fas fa-copy",
-                    "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"],
+                    "enum": [
+                        "symlink",
+                        "rellink",
+                        "link",
+                        "copy",
+                        "copyNoFollow",
+                        "move"
+                    ],
                     "hidden": true
                 },
                 "email": {