Skip to content

Commit

Permalink
prettier changelog, fixes in .network and .processing, tests
Browse files Browse the repository at this point in the history
- prettier changelog
- some style changes to improve Codacy
- finetuning choose_correct_isoform for GlycoDraw formatting
- fixing an issue in find_shared_virtuals
- fixing an issue in distance_from_metrics
- a few more tests
  • Loading branch information
Bribak committed Dec 19, 2024
1 parent dfb786d commit d2f5d55
Show file tree
Hide file tree
Showing 8 changed files with 125 additions and 58 deletions.
68 changes: 36 additions & 32 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,49 +2,49 @@

## [1.5.0]

### Added
### Added
- Added type hints to all functions (e6721a1)
- Added CodeCov shield to track PyTest test code coverage (23d6456)
- Added more PyTest unit tests (e.g., 0c94995, 23d6456, 5a99d6b, f76535e, 94646ad, d5f5d4e, 918d18f, d1a8c6d, 194f31c)
- Added setuptools to required_installs to support pip installation beyond `pip 25.0` (94646ad)
- Added pyproject.toml to support pip installation beyond `pip 25.0` (94646ad)
- Added CITATION.bib to allow for even easier citation of glycowork (a64f694)

### Changed
### Changed 🔄
- Bumped minimum supported Python version to 3.9 (3.8 is no longer supported, see https://devguide.python.org/versions/) (4960c5c)
- Switched docstring style to docments (<https://nbdev.fast.ai/tutorials/best_practices.html#document-parameters-with-docments>) (e6721a1)

### glycan_data
##### Added
##### Added
- Added new named motifs to `motif_list`: DisialylLewisC, "Sia(a2-3)Gal(b1-3)[Sia(a2-6)]GlcNAc"; RM2, "Sia(a2-3)[GalNAc(b1-4)]Gal(b1-3)[Sia(a2-6)]GlcNAc"; DisialylLewisA, "Sia(a2-3)Gal(b1-3)[Fuc(a1-4)][Sia(a2-6)]GlcNAc" (a64f694)
- Added new curated glycomics dataset, `mouse_brain_GSL_PMID39375371` (b94744e)

##### Changed
##### Changed 🔄
- Changed `glycoproteomics_human_keratinocytes_PMID37956981` to `glycoproteomics_human_keratinocytes_N_PMID37956981` (d5f5d4e)
- Improved the description of blood group motifs in `motif_list` (including type 3 blood group antigens, ExtB, and parent motifs) (b94744e)

#### loader
##### Added
##### Added
- Added `count_nested_brackets` helper function to monitor level of nesting in glycans (41bb1a1, d57b836)
- Added dictionaries with lists of strings as values as a new supported data type for `DataFrameSerializer` (034b6ad)

##### Changed
##### Changed 🔄
- Changed `resources.open_text` to `resources.files` to prevent `DeprecationWarning` from `importlib` (0c94995)
- `lectin_specificity` now uses our custom `DataFrameSerializer` and is stored as a .json file rather than a .pkl file, to improve long-term stability across versions (034b6ad)

#### stats
##### Added
##### Added
- Added the "random_state" keyword argument to `clr_transformation` to allow users to provide a reproducible RNG seed (b94744e)

##### Changed
##### Changed 🔄
- For `replace_outliers_winsorization`, in small datasets, the 5% limit is dynamically changed to include at least one datapoint (23d6456)
- Handled the edge case of strong differences in `cohen_d` with zero standard deviation; now outputting positive/negative infinity (23d6456)
- Renamed `test_inter_vs_intra_group` to `compare_inter_vs_intra_group`, to avoid testing issues (23d6456)

##### Deprecated
##### Deprecated ⚠️
- Deprecated `hlm`, `fast_two_sum`, `two_sum`, `expansion_sum`, and `update_cf_for_m_n`, which will all be done in-line instead (e1afe33)

##### Fixed
##### Fixed 🐛
- Fixed DeprecationWarning in `calculate_permanova_stat` for calling nonzero on 0d arrays (23d6456)
- Prevent possible division by zero in pseudo-F calculation in `calculate_permanova_stat` (23d6456)
- Fixed DeprecationWarning in `jtkdist` for calling `np.sum(generator)` (23d6456)
Expand All @@ -55,11 +55,11 @@

### motif
#### tokenization
##### Added
##### Added
- Added `get_random_glycan` to retrieve random glycan sequences (optionally of specific glycan type) (d1a8c6d)
- Supported intramolecular modifications like lactonization in `glycan_to_composition` (8c69c2c)

##### Changed
##### Changed 🔄
- Changed `resources.open_text` to `resources.files` to prevent `DeprecationWarning` from `importlib` (0c94995)
- The monosaccharide keys of the output dictionaries of `glycan_to_composition` are now alphabetically sorted (8c69c2c)
- Modified `calculate_adduct_mass` to deal with a greater variety of adduct handling, such as "C2H4O2", "-H2O", "+Na" to add or subtract masses (8c69c2c)
Expand All @@ -68,17 +68,17 @@
- In addition to chemical formulae, users can now also provide direct additional masses as floats with the same "adduct" keyword argument in `composition_to_mass` and `glycan_to_mass` (d57b836)
- `get_modification` will no longer return the 5Ac / 5Gc of Neu5Ac / Neu5Gc as part of the modification (0387d37)

##### Fixed
##### Fixed 🐛
- Fixed an edge case in `get_unique_topologies`, in which the absence of a universal replacer sometimes created an empty list that was attempted to be indexed (0c94995)
- Made sure that `compositions_to_structures` always returns a DataFrame, even if no matches are found (0c94995)

#### processing
##### Added
##### Added
- Added "antennary_Fuc" as another inferred feature to `infer_features_from_composition` (a64f694)
- Added "IdoA", "GalA", and "Araf" to recognized WURCS2 tokens (52fc16e)
- Added the new "order_by" keyword argument to `choose_correct_isoform` to enforce strictly sorting branches by branch endings / linkages, if desired (918d18f)

##### Changed
##### Changed 🔄
- `check_nomenclature` will now actually raise appropriate Exceptions, in case nomenclature is incompatible with glycowork, instead of print warnings (23d6456)
- Supported triple-branch reordering in `find_isomorphs` and `choose_correct_isoform` (918d18f)
- Improved `find_isomorphs` to swap neighboring branches with different levels of nesting (41bb1a1, 034b6ad)
Expand All @@ -94,24 +94,24 @@
- Ensured that `canonicalize_iupac` works with lactonized glycans (i.e., containing something like "1,7lactone") (8c69c2c)
- `find_matching_brackets_indices` has been renamed to `get_matching_indices` and now takes multiple delimiter choices and returns a generator, including the level of nesting (basically what `.draw.matches` used to do) (e1afe33)

##### Fixed
##### Fixed 🐛
- Fixed component inference in `parse_glycoform` in case of unexpected composition formats (0c94995)
- Fixed an issue in `equal_repeats`, in which identical repeats sometimes were not returning True (0c94995)

#### graph
##### Added
##### Added
- Natively support narrow linkage ambiguity in `categorical_node_match_wildcard`; that means you can use things like "Gal(b1-3/4)GlcNAc" with `subgraph_isomorphism` or `compare_glycans` (as well as all functions using these core functions) and it will only return True for "Gal(b1-3)GlcNAc", "Gal(b1-4)GlcNAc", and "Gal(b1-?)GlcNAc" (b94744e)

##### Changed
##### Changed 🔄
- Ensured that `compare_glycans` is 100% order-specific, never matching something like ("Gal(b1-4)GlcNAc", "GlcNAc(b1-4)Gal") (5a99d6b)
- `glycan_to_nxGraph` will now return an empty graph if the input is an empty string (4f1ccfa)
- `get_possible_topologies` will now also produce a warning (and return the input) if an already defined topology is provided as a pre-calculated graph (3f22f14)

#### draw
##### Added
##### Added
- Added the "drawing" argument to `draw_hex`, `hex_circumference`, `add_bond`, `add_sugar`, and `draw_bracket` to avoid having to operate on global variables (918d18f)

##### Changed
##### Changed 🔄
- `matches` can now also use [] as delimiters (f76535e)
- Support easy import of `GlycoDraw`, via `from glycowork import GlycoDraw` (d5f5d4e)
- Renamed `hex` to `draw_hex`, to avoid overwriting the built-in `hex` (918d18f)
Expand All @@ -121,20 +121,20 @@
- Improved console drawing quality controlled by `display_svg_with_matplotlib` and image quality in Excel cells using `plot_glycans_excel` (a64f694)
- `draw_chem2d` and `draw_chem3d` will now detect whether the user is in a Jupyter environment and, if not, plot to the Matplotlib console (c3a7f64)

##### Deprecated
##### Deprecated ⚠️
- Deprecated `hex_circumference`, the functionality is now available within `draw_hex` with the new keyword argument "outline_only" (4f1ccfa)
- Deprecated `multiple_branches`, `multiple_branch_branches`, `branch_order`, and `reorder_for_drawing` accordingly (41bb1a1)
- Deprecated `matches`, which will now be done by `.processing.get_matching_indices` that has been reworked

##### Fixed
##### Fixed 🐛
- Made sure `scale_in_range` never divides by zero, if value range is zero (f76535e)

#### analysis
##### Changed
##### Changed 🔄
- `get_glycanova` will now raise a ValueError if fewer than three groups are provided in the input data (f76535e)
- Improved console drawing quality controlled by `display_svg_with_matplotlib` and image quality in Excel cells using `plot_glycans_excel` (a64f694)

##### Fixed
##### Fixed 🐛
- Fixed a FutureWarning in `get_lectin_array` by avoiding DataFrame.groupby with axis=1 (f76535e)
- Fixed a RuntimeWarning in `get_biodiversity` by handling statistical tests of identical alpha diversity values between groups (f76535e)
- Made sure that the TSNE perplexity fits the sample size in `plot_embeddings` (d5f5d4e)
Expand All @@ -146,37 +146,41 @@
- Ensured that "effect_size_variance = True" in `get_differential_expression` always formats variances correctly (ef3da9c)

#### regex
##### Fixed
##### Fixed 🐛
- Fixed an issue in `get_match_batch`, in which precompiled patterns caused issues in `get_match` (194f31c)

#### annotate
##### Added
##### Added
- Added `get_size_branching_features` to create glycan size and branching level features for downstream analysis (d57b836)
- Added the "size_branch" option in the "feature_set" keyword argument of `annotate_dataset` and `quantify_motifs`, to analyze glycans by size or level of branching (d57b836)

### ml
#### model_training
##### Added
##### Added
- Added classification-AUROC, multilabel-accuracy, multilabel-MCC, regression-MAE, and regression-R2 as metrics to `train_model` (#66)
- Added the "return_metrics" keyword argument to `train_model` that can additionally return all training and validation metrics (#66)

##### Changed
##### Changed 🔄
- Weigh metric calculation by batch-size (correctly handling the last batch) in `train_model` (#66)
- Best performances in `train_model` are now taken from the overall best model (lowest loss), not from best-model-per-metric (#66)

##### Fixed
##### Fixed 🐛
- Fixed an indexing issue in `train_ml_model` if "additional_features_train" / "additional_features_test" were used (b94744e)

#### inference
##### Changed
##### Changed 🔄
- Changed `resources.open_text` to `resources.files` to prevent `DeprecationWarning` from `importlib` (d1a8c6d)

### network
#### evolution
##### Fixed
##### Fixed 🐛
- Fixed DeprecationWarning in `distance_from_embeddings` to prevent DataFrameGroupBy.apply from operating on the grouping columns (94646ad)
- Fixed an issue in `distance_from_metric` where networks were indexed incorrectly based on presented DataFrame order

#### biosynthesis
##### Changed
##### Changed 🔄
- Made sure in `network_alignment` that only nodes that are virtual in all aligned networks stay virtual (918d18f)
- `choose_leaves_to_extend` will now correctly return no leaf node glycan if the target composition cannot be reached from any of the leaf nodes in a network (918d18f)

##### Fixed 🐛
- Fixed an issue in `find_shared_virtuals` in which no shared nodes were found because of graph comparisons
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,19 @@ pytest

#### sub_module_name

##### Added
##### Added

* Added new feature X (commit-hash)

##### Changed
##### Changed 🔄

* Modified behavior of Y (commit-hash)

##### Deprecated
##### Deprecated ⚠️

* Removed feature Z (commit-hash)

##### Fixed
##### Fixed 🐛

* Fixed bug in W (commit-hash)

Expand Down
4 changes: 2 additions & 2 deletions glycowork/motif/graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ def expand_termini_list(motif: Union[str, nx.Graph], # Glycan motif sequence or
num_linkages = motif.count('(') if isinstance(motif, str) else (len(motif) - 1) // 2
result = [None] * (len(termini_list) + num_linkages)
j = 0
for i in range(len(result)):
for i, _ in enumerate(result):
if i % 2 == 0:
result[i] = termini_list[j]
j += 1
Expand Down Expand Up @@ -479,7 +479,7 @@ def try_string_conversion(graph: nx.Graph # Glycan graph to validate
temp = graph_to_string(graph)
temp = glycan_to_nxGraph(temp)
return graph_to_string(temp)
except:
except (ValueError, IndexError):
return None


Expand Down
2 changes: 1 addition & 1 deletion glycowork/motif/processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ def compare_branches(branch1: str, branch2: str, use_linkage: bool = False) -> b
kill_list.add(g)
if order_by == "linkage":
for g in glycans2:
if g[:g.index('[')].count('(') == 1 and g[g.index('['):g.index(']')].count('(') > 1:
if g[:g.index('[')].count('(') == 1 and g[g.index('['):g.index(']')].count('(') > 1 and g.count('[') > 1 and g.startswith('F'):
kill_list.add(g)
if pair_match := re.search(r'\[((?:[^[\]]|\[(?:[^[\]]|\[[^[\]]*\])*\])*)\]\[((?:[^[\]]|\[(?:[^[\]]|\[[^[\]]*\])*\])*)\]', g):
if compare_branches(pair_match.group(1), pair_match.group(2)):
Expand Down
2 changes: 1 addition & 1 deletion glycowork/motif/tokenization.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def stemify_glycan(glycan: str, # Glycan in IUPAC-condensed format
cut = glycan_start[index_pos:]
try:
cut = cut.split('(', 1)[0]
except:
except IndexError:
pass
# Replace offending monosaccharide with stemified monosaccharide
if cut not in clean_list:
Expand Down
16 changes: 7 additions & 9 deletions glycowork/network/biosynthesis.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,8 +147,9 @@ def find_shared_virtuals(glycan_a: str, # First glycan
ggraph_nb_b, _ = get_virtual_nodes(glycan_b, graph_dic, min_size = min_size)
# Check whether any of the nodes of glycan_a and glycan_b are the same
out = set((glycan_a, glycans_a[k]) for k, graph_a in enumerate(ggraph_nb_a)
for graph_b in ggraph_nb_b if compare_glycans(graph_a, graph_b))
out.update((glycan_b, g[1]) for g in out)
for graph_b in ggraph_nb_b if subgraph_isomorphism(graph_a, graph_b))
out2 = {(glycan_b, g[1]) for g in out}
out.update(out2)
return list(out)


Expand Down Expand Up @@ -240,13 +241,10 @@ def find_shortest_path(goal_glycan: str, # Target glycan
for glycan in sorted(glycan_list, key = len, reverse = True):
# For each glycan, check whether it could constitute a precursor (i.e., is it a sub-graph + does it stem from the correct root)
if len(glycan) < len(goal_glycan) and goal_glycan.endswith(glycan[-5:]) and subgraph_isomorphism(ggraph, safe_index(glycan, graph_dic)):
try:
# Finding a path through shells of generated virtual nodes
virtual_edges, edge_labels = find_path(goal_glycan, glycan, graph_dic,
permitted_roots = permitted_roots, min_size = min_size, allowed_ptms = allowed_ptms)
return virtual_edges, edge_labels
except:
continue
# Finding a path through shells of generated virtual nodes
virtual_edges, edge_labels = find_path(goal_glycan, glycan, graph_dic,
permitted_roots = permitted_roots, min_size = min_size, allowed_ptms = allowed_ptms)
return virtual_edges, edge_labels
return [], {}


Expand Down
2 changes: 1 addition & 1 deletion glycowork/network/evolution.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def distance_from_metric(df: pd.DataFrame, # DataFrame with glycans (rows) and t
# Get all objects to calculate distance between
value_counts = df[rank].value_counts()
valid_ranks = value_counts.index[value_counts >= cut_off]
valid_networks = [net for spec, net in zip(df[rank], networks) if spec in valid_ranks]
valid_networks = [net for spec, net in zip(value_counts.index, networks) if spec in valid_ranks]
# Get distance matrix
return calculate_distance_matrix(valid_networks, dist_func, label_list = valid_ranks.tolist())

Expand Down
Loading

0 comments on commit d2f5d55

Please sign in to comment.