-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code for computing the metrics #5
Comments
Hi @violet-sto, thank you for the interest. The easiest way to get batch descriptors would be the following: Generating the reference distributionFirst, you will need a reference dataset, for example the training/validation set of the model (i.e., AHO-aligned sequences from paired OAS), loaded into a DataFrame. Suppose the heavy and light sequences are stored in Using the dataframe, you could run the following code: from walkjump.metrics import LargeMoleculeDescriptors
from tqdm import tqdm
def get_descriptors_as_dict(sequence: str) -> dict:
return {k: v for k, v in LargeMoleculeDescriptors.from_sequence(sequence).asdict().items() if k in set(LargeMoleculeDescriptors.descriptor_names)}
def rename_df(df: pd.DataFrame, prefix: str):
df.rename({c: f"{prefix}_{c}" for c in df.columns, inplace=True, axis=1}
df = LOAD_DF(...) # load your csv of paired sequences
tqdm.pandas(desc="heavy")
descriptor_df_heavy = pd.DataFrame.from_records(df.fv_heavy_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for heavy chains
descriptor_df_light = pd.DataFrame.from_records(df.fv_light_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for light chains
rename_df(descriptor_df_heavy, "fv_heavy")
rename_df(descriptor_df_light, "fv_light")
ref_feats = pd.concat([descriptor_df_heavy, descriptor_df_light, df], axis=1) Generating samplesNext, generate samples using from walkjump.sampling import walkjump
sample_df = walkjump(seed_sequences, **params)
samp_descriptor_df_heavy = pd.DataFrame.from_records(sample_df.fv_heavy_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for heavy chains
samp_descriptor_df_light = pd.DataFrame.from_records(sample_df.fv_light_aho.str.replace("-", "").progress_apply(get_descriptors_as_dict).values) # make descriptors for light chains
rename_df(samp_descriptor_df_heavy, "fv_heavy")
rename_df(samp_descriptor_df_light, "fv_light")
sample_df_with_descriptors = pd.concat([sample_df, samp_descriptor_df_heavy, samp_descriptor_df_light], axis=1) Getting batch descriptorsLastly, you can get the descriptors now by from walkjump.metrics import get_batch_descriptors
description_heavy = get_batch_descriptors(sample_df_with_descriptors, ref_feats, "fv_heavy")
description_light = get_batch_descriptors(sample_df_with_descriptors, ref_feats, "fv_light") Hope this helps! |
Hi @djberenberg, thanks for your reply! I have successfully computed W_average according to your code. However, I'm still confused about how to reproduce the metrics reported in Table 2, since some details seem missing in the paper. 1) How do you define the reference set? I notice that all sequences are split into train/val/test sets. This question also arises when I compute DCS. 2) How many sequences does each model generate? 3) The reported W_property is computed over heavy chains or average over heavy chains and light chains? |
Hi
Thanks for your excellent work! I have some trouble in computing the metrics. I don't know how to get the inputs for get_batch_descriptors. Could you please provide a script for easily evaluating the model?
The text was updated successfully, but these errors were encountered: