Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for computing the metrics #5

Open
violet-sto opened this issue Jun 1, 2024 · 2 comments
Open

Code for computing the metrics #5

violet-sto opened this issue Jun 1, 2024 · 2 comments

Comments

@violet-sto
Copy link

Hi

Thanks for your excellent work! I have some trouble in computing the metrics. I don't know how to get the inputs for get_batch_descriptors. Could you please provide a script for easily evaluating the model?

@djberenberg
Copy link

djberenberg commented Jun 3, 2024

Hi @violet-sto, thank you for the interest. The easiest way to get batch descriptors would be the following:

Generating the reference distribution

First, you will need a reference dataset, for example the training/validation set of the model (i.e., AHO-aligned sequences from paired OAS), loaded into a DataFrame. Suppose the heavy and light sequences are stored in fv_heavy_aho, fv_light_aho respectively.

Using the dataframe, you could run the following code:

from walkjump.metrics import LargeMoleculeDescriptors

from tqdm import tqdm

def get_descriptors_as_dict(sequence: str) -> dict:
    return {k: v for k, v in LargeMoleculeDescriptors.from_sequence(sequence).asdict().items() if k in set(LargeMoleculeDescriptors.descriptor_names)}

def rename_df(df: pd.DataFrame, prefix: str):
    df.rename({c: f"{prefix}_{c}" for c in df.columns, inplace=True, axis=1}

df = LOAD_DF(...) # load your csv of paired sequences

tqdm.pandas(desc="heavy")
descriptor_df_heavy = pd.DataFrame.from_records(df.fv_heavy_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for heavy chains
descriptor_df_light = pd.DataFrame.from_records(df.fv_light_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for light chains

rename_df(descriptor_df_heavy, "fv_heavy")
rename_df(descriptor_df_light, "fv_light")

ref_feats = pd.concat([descriptor_df_heavy, descriptor_df_light, df], axis=1)

Generating samples

Next, generate samples using walkjump.sampling.walkjump() and compute metrics.

from walkjump.sampling import walkjump

sample_df = walkjump(seed_sequences, **params)

samp_descriptor_df_heavy = pd.DataFrame.from_records(sample_df.fv_heavy_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for heavy chains
samp_descriptor_df_light = pd.DataFrame.from_records(sample_df.fv_light_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for light chains

rename_df(samp_descriptor_df_heavy, "fv_heavy")
rename_df(samp_descriptor_df_light, "fv_light")


sample_df_with_descriptors = pd.concat([sample_df, samp_descriptor_df_heavy, samp_descriptor_df_light], axis=1)

Getting batch descriptors

Lastly, you can get the descriptors now by

from walkjump.metrics import get_batch_descriptors

description_heavy = get_batch_descriptors(sample_df_with_descriptors, ref_feats, "fv_heavy")
description_light = get_batch_descriptors(sample_df_with_descriptors, ref_feats, "fv_light")

Hope this helps!
Dan

@violet-sto
Copy link
Author

Hi @djberenberg, thanks for your reply!

I have successfully computed W_average according to your code. However, I'm still confused about how to reproduce the metrics reported in Table 2, since some details seem missing in the paper. 1) How do you define the reference set? I notice that all sequences are split into train/val/test sets. This question also arises when I compute DCS. 2) How many sequences does each model generate? 3) The reported W_property is computed over heavy chains or average over heavy chains and light chains?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants