Skip to content

Commit

Permalink
clean_up_heatmap fix and correct Methyl masses
Browse files Browse the repository at this point in the history
- correct Methyl masses
- duplicate fix in clean_up_heatmap
- style improvements
  • Loading branch information
Bribak committed Jan 9, 2025
1 parent 87ea2fc commit e3eeb32
Show file tree
Hide file tree
Showing 5 changed files with 28 additions and 19 deletions.
2 changes: 2 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# ⚠️ IMPORTANT: Branch Strategy

This repository follows a strict branch strategy:

- `master` branch is ONLY for PyPI release mirroring
- All development PRs MUST target the `dev` branch
- If your PR targets `master`, it will be flagged and you'll be asked to retarget to `dev`
Expand Down
14 changes: 9 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,12 @@
- `lectin_specificity` now uses our custom `DataFrameSerializer` and is stored as a .json file rather than a .pkl file, to improve long-term stability across versions (034b6ad)

##### Fixed 🐛
- Fixed DeprecationWarning in all data-loading functions that used `importlib.resources.open_text` or `.content`
- Fixed DeprecationWarning in all data-loading functions that used `importlib.resources.open_text` or `.content` (87ea2fc)

#### stats
##### Added ✨
- Added the "random_state" keyword argument to `clr_transformation` to allow users to provide a reproducible RNG seed (b94744e)
- Added the `JTKTest` class object
- Added the `JTKTest` class object (87ea2fc)

##### Changed 🔄
- For `replace_outliers_winsorization`, in small datasets, the 5% limit is dynamically changed to include at least one datapoint (23d6456)
Expand All @@ -47,7 +47,7 @@

##### Deprecated ⚠️
- Deprecated `hlm`, `fast_two_sum`, `two_sum`, `expansion_sum`, and `update_cf_for_m_n`, which will all be done in-line instead (e1afe33)
- Deprecated `jtkdist`, `jtkinit`, `jtkstat`, `jtkx`, which will all be done by the new `JTKTest`
- Deprecated `jtkdist`, `jtkinit`, `jtkstat`, `jtkx`, which will all be done by the new `JTKTest` (87ea2fc)

##### Fixed 🐛
- Fixed DeprecationWarning in `calculate_permanova_stat` for calling nonzero on 0d arrays (23d6456)
Expand Down Expand Up @@ -76,6 +76,7 @@
##### Fixed 🐛
- Fixed an edge case in `get_unique_topologies`, in which the absence of a universal replacer sometimes created an empty list that was attempted to be indexed (0c94995)
- Made sure that `compositions_to_structures` always returns a DataFrame, even if no matches are found (0c94995)
- Provided correct exact methyl masses in `mass_dict`

#### processing
##### Added ✨
Expand Down Expand Up @@ -138,7 +139,7 @@
##### Changed 🔄
- `get_glycanova` will now raise a ValueError if fewer than three groups are provided in the input data (f76535e)
- Improved console drawing quality controlled by `display_svg_with_matplotlib` and image quality in Excel cells using `plot_glycans_excel` (a64f694)
- The "periods" argument in `get_jtk` is now a keyword argument and has a default value of [12, 24]
- The "periods" argument in `get_jtk` is now a keyword argument and has a default value of [12, 24] (87ea2fc)

##### Fixed 🐛
- Fixed a FutureWarning in `get_lectin_array` by avoiding DataFrame.groupby with axis=1 (f76535e)
Expand All @@ -150,7 +151,7 @@
- Fixed an issue where variance-filtered rows could cause problems in `get_differential_expression` if "monte_carlo = True" (ef3da9c)
- Fixed an issue in `get_differential_expression` if "sets = True" that caused indexing issues under certain conditions (ef3da9c)
- Ensured that "effect_size_variance = True" in `get_differential_expression` always formats variances correctly (ef3da9c)
- Ensured that the combination of "grouped_BH = True", "paired = False", and CLR/ALR in `get_differential_expression` works even when negative values are present
- Ensured that the combination of "grouped_BH = True", "paired = False", and CLR/ALR in `get_differential_expression` works even when negative values are present (87ea2fc)

#### regex
##### Fixed 🐛
Expand All @@ -161,6 +162,9 @@
- Added `get_size_branching_features` to create glycan size and branching level features for downstream analysis (d57b836)
- Added the "size_branch" option in the "feature_set" keyword argument of `annotate_dataset` and `quantify_motifs`, to analyze glycans by size or level of branching (d57b836)

##### Fixed 🐛
- Fixed an issue in `clean_up_heatmap` in which, occasionally, duplicate strings were introduced in the output

### ml
#### model_training
##### Added ✨
Expand Down
7 changes: 3 additions & 4 deletions glycowork/glycan_data/stats.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
import pandas as pd
import numpy as np
import math
import warnings
from typing import Dict, List, Optional, Tuple, Union
from collections import Counter
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import wilcoxon, rankdata, norm, chi2, t, f, entropy, gmean, f_oneway, combine_pvalues, dirichlet, spearmanr, ttest_rel, ttest_ind
from scipy.stats import rankdata, norm, chi2, t, f, entropy, gmean, f_oneway, combine_pvalues, dirichlet, spearmanr, ttest_rel, ttest_ind
from scipy.stats.mstats import winsorize
from scipy.spatial import procrustes
from scipy.spatial.distance import squareform
Expand Down Expand Up @@ -250,10 +249,10 @@ def test(self, values: np.ndarray) -> Tuple[float, int, int, float]:
best_stats = (1.0, self.periods[0], 0, 0)
for period in self.periods:
waveforms_period = self.waveforms[period]
for phase in range(len(waveforms_period)):
for phase, waveform_phase in enumerate(waveforms_period):
if phase > 0 and (period+(phase*self.interval) in self.periods or period-(phase*self.interval) in self.periods):
continue
S = (signs * waveforms_period[phase]).sum()
S = (signs * waveform_phase).sum()
if S == 0:
continue
jtk = (abs(S) + self.max_stat) / 2
Expand Down
22 changes: 13 additions & 9 deletions glycowork/motif/annotate.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,16 +275,20 @@ def clean_up_heatmap(
) -> pd.DataFrame: # DataFrame with redundant motifs removed
"Removes redundant motif entries from glycan abundance data while preserving the most informative labels"
motif_dic = dict(zip(motif_list.motif_name, motif_list.motif))
df.index = df.index.to_series().apply(lambda x: motif_dic[x] + ' '*20 if x in motif_dic else x)
original_index = df.index.copy()
df.index = df.index.to_series().apply(lambda x: motif_dic[x] + ' ' * 20 if x in motif_dic else x)
df['_original_position'] = range(len(df))
# Group the DataFrame by identical rows
grouped = df.groupby(list(df.columns))
# Find the row with the longest string index within each group and return a new DataFrame
max_idx_series = grouped.apply(lambda group: group.index.to_series().str.len().idxmax(), include_groups = False)
result = df.loc[max_idx_series].drop_duplicates()
result.index = result.index.str.strip()
motif_dic = {value: key for key, value in motif_dic.items()}
result.index = [motif_dic.get(k, k) for k in result.index]
return result
grouped = df.groupby(list(df.columns[:-1]), sort = False)
# Find the integer indices of rows with the longest string index within each group
max_idx_positions = []
for _, group in grouped:
# Find the row with the longest string index
longest_idx = group.index.to_series().str.len().idxmax()
# Retrieve the original integer position of this row
max_idx_positions.append(group.loc[longest_idx, '_original_position'])
df.index = original_index
return df.iloc[max_idx_positions].drop_duplicates().drop(['_original_position'], axis = 1)


def quantify_motifs(
Expand Down
2 changes: 1 addition & 1 deletion glycowork/motif/mz_to_composition.csv
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,4 @@ Trifluoroacetic acid,112.9850391,113.0160096,112.9850391,113.0160096,112.9850391
PCho,165.0555,165.0555,,,,
red_end,18.0105546,18.0105546,46.0419,46.0419,102.0317,102.0317
PEtN,123.0628,123.0628,,,,
Methyl,14.3,14.3,14.3,14.3,,
Methyl,14.01565,14.0266,14.01565,14.0266,14.01565,14.0266

0 comments on commit e3eeb32

Please sign in to comment.