Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structure tokens #43

Merged
merged 11 commits into from
Mar 10, 2025
Merged

Structure tokens #43

merged 11 commits into from
Mar 10, 2025

Conversation

ncfrey
Copy link
Collaborator

@ncfrey ncfrey commented Mar 7, 2025

No description provided.

Copy link
Collaborator Author

@ncfrey ncfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def _make_latent_generator_tokenizer() -> PreTrainedTokenizerFast:
"""Create a `PreTrainedTokenizerFast` object for tokenization of protein structure latent generator sequences.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this more general and say "tokenization of 3D coordinates"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to add the same comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in rename it to something like: _make_3d_coordinates_tokenizer or just change the description?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think renaming the whole tokenizer class and related functions to something that includes 3D coordinates makes sense since it's not immediately obvious what the tokenizer is for otherwise


from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast

LG_VOCAB = {'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, '<mask>': 4, '.': 5, 'a': 6, 'b': 7, 'c': 8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a strong preference but this could be a txt file? either way works though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep it as a dictionary the vocab.txt format just ends up adding more operations to get back to the dictionary, and i liek the simplicity of being able to use the dictionary elsewhere

@karinazad karinazad marked this pull request as ready for review March 10, 2025 14:03

from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast

LG_VOCAB = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the file is a bit too long when these are defined as a dictionary. Might be better just to use the saved txt file instead

@Sidney-Lisanza Sidney-Lisanza merged commit 35c691e into main Mar 10, 2025
5 checks passed
@Sidney-Lisanza Sidney-Lisanza deleted the structure_tokens branch March 10, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants