-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structure tokens #43
Structure tokens #43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def _make_latent_generator_tokenizer() -> PreTrainedTokenizerFast: | ||
"""Create a `PreTrainedTokenizerFast` object for tokenization of protein structure latent generator sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this more general and say "tokenization of 3D coordinates"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to add the same comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in rename it to something like: _make_3d_coordinates_tokenizer or just change the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think renaming the whole tokenizer class and related functions to something that includes 3D coordinates makes sense since it's not immediately obvious what the tokenizer is for otherwise
|
||
from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast | ||
|
||
LG_VOCAB = {'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, '<mask>': 4, '.': 5, 'a': 6, 'b': 7, 'c': 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a strong preference but this could be a txt file? either way works though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep it as a dictionary the vocab.txt format just ends up adding more operations to get back to the dictionary, and i liek the simplicity of being able to use the dictionary elsewhere
|
||
from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast | ||
|
||
LG_VOCAB = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the file is a bit too long when these are defined as a dictionary. Might be better just to use the saved txt file instead
No description provided.