Code for computing the metrics #5

violet-sto · 2024-06-01T13:23:48Z

Hi

Thanks for your excellent work! I have some trouble in computing the metrics. I don't know how to get the inputs for get_batch_descriptors. Could you please provide a script for easily evaluating the model?

djberenberg · 2024-06-03T10:21:51Z

Hi @violet-sto, thank you for the interest. The easiest way to get batch descriptors would be the following:

Generating the reference distribution

First, you will need a reference dataset, for example the training/validation set of the model (i.e., AHO-aligned sequences from paired OAS), loaded into a DataFrame. Suppose the heavy and light sequences are stored in fv_heavy_aho, fv_light_aho respectively.

Using the dataframe, you could run the following code:

from walkjump.metrics import LargeMoleculeDescriptors

from tqdm import tqdm

def get_descriptors_as_dict(sequence: str) -> dict:
    return {k: v for k, v in LargeMoleculeDescriptors.from_sequence(sequence).asdict().items() if k in set(LargeMoleculeDescriptors.descriptor_names)}

def rename_df(df: pd.DataFrame, prefix: str):
    df.rename({c: f"{prefix}_{c}" for c in df.columns, inplace=True, axis=1}

df = LOAD_DF(...) # load your csv of paired sequences

tqdm.pandas(desc="heavy")
descriptor_df_heavy = pd.DataFrame.from_records(df.fv_heavy_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for heavy chains
descriptor_df_light = pd.DataFrame.from_records(df.fv_light_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for light chains

rename_df(descriptor_df_heavy, "fv_heavy")
rename_df(descriptor_df_light, "fv_light")

ref_feats = pd.concat([descriptor_df_heavy, descriptor_df_light, df], axis=1)

Generating samples

Next, generate samples using walkjump.sampling.walkjump() and compute metrics.

from walkjump.sampling import walkjump

sample_df = walkjump(seed_sequences, **params)

samp_descriptor_df_heavy = pd.DataFrame.from_records(sample_df.fv_heavy_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for heavy chains
samp_descriptor_df_light = pd.DataFrame.from_records(sample_df.fv_light_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for light chains

rename_df(samp_descriptor_df_heavy, "fv_heavy")
rename_df(samp_descriptor_df_light, "fv_light")


sample_df_with_descriptors = pd.concat([sample_df, samp_descriptor_df_heavy, samp_descriptor_df_light], axis=1)

Getting batch descriptors

Lastly, you can get the descriptors now by

from walkjump.metrics import get_batch_descriptors

description_heavy = get_batch_descriptors(sample_df_with_descriptors, ref_feats, "fv_heavy")
description_light = get_batch_descriptors(sample_df_with_descriptors, ref_feats, "fv_light")

Hope this helps!
Dan

violet-sto · 2024-06-13T07:45:17Z

Hi @djberenberg, thanks for your reply!

I have successfully computed W_average according to your code. However, I'm still confused about how to reproduce the metrics reported in Table 2, since some details seem missing in the paper. 1) How do you define the reference set? I notice that all sequences are split into train/val/test sets. This question also arises when I compute DCS. 2) How many sequences does each model generate? 3) The reported W_property is computed over heavy chains or average over heavy chains and light chains?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code for computing the metrics #5

Code for computing the metrics #5

violet-sto commented Jun 1, 2024

djberenberg commented Jun 3, 2024 •

edited

Loading

violet-sto commented Jun 13, 2024

Code for computing the metrics #5

Code for computing the metrics #5

Comments

violet-sto commented Jun 1, 2024

djberenberg commented Jun 3, 2024 • edited Loading

Generating the reference distribution

Generating samples

Getting batch descriptors

violet-sto commented Jun 13, 2024

djberenberg commented Jun 3, 2024 •

edited

Loading