seglh-naming

This repository contains libraries that build, read and validate SEGLH sample/analysis naming conventions.

Installation

The repository provides a python package which can be installed with:

python setup.py install

NB: Use the --user flag or install into an virtualenv/pipenv if not installing globally.

Contributions

Any contributions must follow GIT-Flow. Code reviews are mandatory and must be done by a representative of each site implementing the naming scheme.

A full test suite is provided and can be run with pytest -v. Code formatting must follow recommendations of PEP8. Both requirements are checked automatically via GitHub Actions on pushes to develop and main branches

Naming schemes

After installation the package can be used to validate the folling naming schemes:

Sample name
Samplesheet name

The libraries validate samplesheet name/path, sample name or file name/path for conformity (formatting, required identifiers). This is performed by accessing the corresponding class property. If validation fails, a ValueError exception is raised.

The naming schemes are structured as follows:

Sample name

NGS123_12_382398_P9392B_JD_M_VCP0R33_Pan0000_RJZ_S12_R1_001.realigned.bam
+===== += +===== +===== += + +====== +====== +== +== += +==+=============
|      |  |      |      |  | |       |       |   |   |  |  |
|      |  |      |      |  | |       |       |   |   |  |  +- rest (trailing string, optional)
|      |  |      |      |  | |       |       |   |   |  +---- stable (not informative, optional)
|      |  |      |      |  | |       |       |   |   +------- readnumber (optional)
|      |  |      |      |  | |       |       |   +----------- samplesheetindex (optional)
|      |  |      |      |  | |       |       +--------------- ods (ODS code, optional)
|      |  |      |      |  | |       +----------------------- panelnumber (Pan Number)
|      |  |      |      |  | +------------------------------- panelname (Human readable Pan number, optional)
|      |  |      |      |  +--------------------------------- sex (optional)
|      |  |      |      +------------------------------------ initials (secondary identifier, optional)
|      |  |      +------------------------------------------- id2 (secondary identifier, optional)
|      |  +-------------------------------------------------- id1 (DNA number)
|      +----------------------------------------------------- samplecount (number in batch/library)
+------------------------------------------------------------ libraryprep (library name)

Requirements

When creating the sample name (e.g. in the sample sheet) the sample name should adhere to the following conventions. Examples are provided.

Special characters (e.g. -, $, _) are not allowed within the identifiers, and identifiers must be spaced using a single underscore only.

Element	Requirements	Examples
libraryprep	≥3 uppercase letters followed by ≥1 digit and ≥0 digits/letters/combination (upper and/or lower case)	NGS431, NGS431rpt, ADX346, TSO239, SNP123, WES894
samplecount	2-3 digits	00, 15, 120
id1	≥6 digits	123456, 000000
id2	Starts with either 1 digit, 'HD', 'NA', 'NTC', 'NC' or 'SC'. Followed by at least 3 digits/letters/combination (upper and/or lower case)	223785, 2345, NA12878, HD200, NTC000, NC000, SC07100496
initials	2 uppercase letters	AW, RD
sex	'M', 'F', or 'U'	M, F, U
panelname	3+ digits/letters/combination (upper and/or lower case)	FFPEControl, SWIFT57, VCP1R134StG, SNPIDv2, CRC
panelnumber	'Pan' followed by 2+ digits	Pan2345, Pan3456

The following additional requirements must also be met:

TSO500 sample names must not be longer than 40 characters
The sample name must contain id1
The sample name must also contain: id2, or both initials and sex

Samplesheet name

211008_A01229_0040_AHKGTFDRXY_SampleSheet.csv
+===== +===== +=== +========= +========== +==
|      |      |    |          |           |
|      |      |    |          |           +- fileext
|      |      |    |          +------------- samplesheetstr
|      |      |    +------------------------ flowcellid
|      |      +----------------------------- autoincrno
|      +------------------------------------ sequencerid
+------------------------------------------- date (YYMMDD)

Requirements

When creating the samplesheet name, it should adhere to the following conventions. Examples are provided.

Special characters (e.g. -, $, _) are not allowed within the identifiers unless explicitly stated, and identifiers must be spaced using a single underscore only. There must not be an underscore between 'SampleSheet' and '.csv'.

Element	Requirements	Examples
date	6 digits	221004
sequencerid	≥1 digits or uppercase letters	M02353, NB551068
autoincrno	4 digits	0123
flowcellid	EITHER (9 zeros followed by a hyphen then 5 digits/uppercase letters/combination) OR (9 digits/uppercase letters/combination)	000000000-JRLLW, AHJYL5AFX3
samplesheetstr	'SampleSheet' Must be an exact match	SampleSheet
fileext	'.csv' Must be an exact match	.csv

Usage

Sample

from seglh_naming.sample import Sample

# Validate sample name
sample = Sample.from_string('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001')

# Build and validate from the constituent parts
sample = Sample.from_dict({
	"libraryprep": "NGS123",
	"samplecount": 12,
	"id1": "382398",
	"initials":	"JD",
	"sex": "M",
	"panelname": "VCP0R33",
	"panelnumber": "Pan0000"
})

sample = Sample.from_string('NGS123_12_382398_JD_C_VCP0R33_Pan0000_S12_R1_001')
# ValueError: Sex invalid (C)

Get name and constituent parts

Get the minimal required Sample ID.

from seglh_naming.sample import Sample

sample = Sample.from_string('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001.realigned.bam')

print(sample)
# NGS123_12_382398_JD_M_VCP0R33_Pan0000

print(repr(sample))
# NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001

## Samplesheet
```python
from seglh_naming.samplesheet import Samplesheet

# Validate samplesheet name
samplesheet = Samplesheet.from_string('211108_A01229_0040_AHKGTFDRXY_SampleSheet.csv')

samplesheet = Samplesheet.from_string('21110_A01229_0040_AHKGTFDRXY_SampleSheet.csv')
# ValueError: Date invalid (21110)

Get or edit constituents of sample name.

from seglh_naming.sample import Sample

sample = Sample.from_string('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001')

sample.id1
# 382398

sample.id1 = '000111'

print(sample)
# NGS123_12_000111_JD_M_VCP0R33_Pan0000

print(sample.is_modified)
# True

File extensions or types

from seglh_naming.sample import Sample

sample = Sample.from_string('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001.realigned.vcf.gz')

print(sample.file_extension())
# vcf.gz

print(sample.file_extension(include_compression=False))
# vcf

Deidentify

Return a stable identifier for a given sample ID as a salted, cryptographic hash (SHA256).

from seglh_naming.sample import Sample

print(Sample.from_string('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001').hash())
# 998121029e4cd9b64ec7f9218f776255dd16642db498c50e3f2f378153272d84
print(Sample.from_string('NGS123_12_382398_JD_M_VCP0R33_Pan0000').hash())
# 998121029e4cd9b64ec7f9218f776255dd16642db498c50e3f2f378153272d84
print(Sample.from_string('NGS123_12_382398_JD_M_VCP0R33_Pan0001').hash())
# 9b37c0d8271ca42e5e1067feb22ff3ff2163e549a6094cc2c11ac912d463f07b

Samplesheet

Get name and constituent parts

Get the minimal required string from the input.

from seglh_naming.samplesheet import Samplesheet

samplesheet = Samplesheet.from_string('/home/samplesheet/211108_A01229_0040_AHKGTFDRXY_SampleSheet.csv')

print(samplesheet)
# 211108_A01229_0040_AHKGTFDRXY_SampleSheet.csv

print(repr(sample))
# 211108_A01229_0040_AHKGTFDRXY_SampleSheet.csv

Get or edit constituents of samplesheet name.

from seglh_naming.samplesheet import Samplesheet

samplesheet = Samplesheet.from_string('211108_A01229_0040_AHKGTFDRXY_SampleSheet.csv')

samplesheet.sequencerid

# 'A01229'

samplesheet.sequencerid = 'NB552085'

print(samplesheet)
# 211108_NB552085_0040_AHKGTFDRXY_SampleSheet.csv

print(samplesheet.is_modified)
# True

Deidentify

Returns a stable identifier for a given sample ID as a salted, cryptographic hash (SHA256).

from seglh_naming.samplesheet import Samplesheet

print(Samplesheet.from_string('211108_A01229_0040_AHKGTFDRXY_SampleSheet.csv').hash())
# 4b9259df18af72841419d832988d6ffdd58ec16525b6ccdca908b9e410803c62
print(Samplesheet.from_string('220401_M02353_0676_000000000-K4669_SampleSheet.csv').hash())
# 22ed36ee4ec1a858dc46c49d924799f810f055031d4a6a348c6609766a076741
print(Samplesheet.from_string('220401_NB552085_0188_AHJWL5AFX3_SampleSheet.csv').hash())
# 6e996e99483f40c3b9ebd52d0f56c672813907bfe58a08d92f4155f5050c86a3

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
seglh_naming		seglh_naming
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seglh-naming

Installation

Contributions

Naming schemes

Sample name

Requirements

Samplesheet name

Requirements

Usage

Sample

Get name and constituent parts

File extensions or types

Deidentify

Samplesheet

Get name and constituent parts

Deidentify

About

Releases 5

Packages

Contributors 5

Languages

License

moka-guys/seglh-naming

Folders and files

Latest commit

History

Repository files navigation

seglh-naming

Installation

Contributions

Naming schemes

Sample name

Requirements

Samplesheet name

Requirements

Usage

Sample

Get name and constituent parts

File extensions or types

Deidentify

Samplesheet

Get name and constituent parts

Deidentify

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 5

Languages

Packages