Structure tokens #43

ncfrey · 2025-03-07T16:09:54Z

No description provided.

ncfrey

add tests in style of

https://github.com/prescient-design/lobster/blob/main/tests/lobster/tokenization/test__nucleotide_tokenizer.py

ncfrey · 2025-03-07T16:13:09Z

src/lobster/tokenization/_latent_generator_tokens.py

+
+
+def _make_latent_generator_tokenizer() -> PreTrainedTokenizerFast:
+    """Create a `PreTrainedTokenizerFast` object for tokenization of protein structure latent generator sequences.


should we make this more general and say "tokenization of 3D coordinates"?

I was going to add the same comment

as in rename it to something like: _make_3d_coordinates_tokenizer or just change the description?

I think renaming the whole tokenizer class and related functions to something that includes 3D coordinates makes sense since it's not immediately obvious what the tokenizer is for otherwise

karinazad · 2025-03-07T16:33:28Z

src/lobster/tokenization/_latent_generator_tokens.py

+
+from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast
+
+LG_VOCAB = {'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, '<mask>': 4, '.': 5, 'a': 6, 'b': 7, 'c': 8, 


not a strong preference but this could be a txt file? either way works though

can we keep it as a dictionary the vocab.txt format just ends up adding more operations to get back to the dictionary, and i liek the simplicity of being able to use the dictionary elsewhere

karinazad · 2025-03-10T14:05:03Z

src/lobster/tokenization/_latent_generator_3d_coord_tokenizer.py

+
+from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast
+
+LG_VOCAB = {


I think the file is a bit too long when these are defined as a dictionary. Might be better just to use the saved txt file instead

…n/lobster into structure_tokens

Sidney Lisanza added 2 commits March 6, 2025 15:41

lg tokenizer

26e5746

lg tokenizer assets

8eaa2ac

ncfrey requested a review from karinazad March 7, 2025 16:09

ncfrey assigned Sidney-Lisanza Mar 7, 2025

ncfrey commented Mar 7, 2025

View reviewed changes

karinazad reviewed Mar 7, 2025

View reviewed changes

lint

a996864

Sidney-Lisanza temporarily deployed to test.pypi.org March 7, 2025 21:05 — with GitHub Actions Inactive

vocab.txt

542b3ac

Sidney-Lisanza temporarily deployed to test.pypi.org March 7, 2025 21:09 — with GitHub Actions Inactive

added test and new wor level model v bpe

ae9fc1d

Sidney-Lisanza temporarily deployed to test.pypi.org March 8, 2025 14:41 — with GitHub Actions Inactive

rename to include coord tokenization explicity

f0bc0ba

Sidney-Lisanza temporarily deployed to test.pypi.org March 10, 2025 12:28 — with GitHub Actions Inactive

ruff

42a2712

karinazad temporarily deployed to test.pypi.org March 10, 2025 13:53 — with GitHub Actions Inactive

ruff tests

5cc5f70

karinazad temporarily deployed to test.pypi.org March 10, 2025 13:56 — with GitHub Actions Inactive

remove __name__==main

2d20c8f

karinazad temporarily deployed to test.pypi.org March 10, 2025 13:57 — with GitHub Actions Inactive

karinazad marked this pull request as ready for review March 10, 2025 14:03

karinazad reviewed Mar 10, 2025

View reviewed changes

karinazad approved these changes Mar 10, 2025

View reviewed changes

Sidney Lisanza added 2 commits March 10, 2025 13:23

use vocab.txt instead of ductionary

28b8eed

Merge branch 'structure_tokens' of https://github.com/prescient-desig…

06d87ce

…n/lobster into structure_tokens

Sidney-Lisanza temporarily deployed to test.pypi.org March 10, 2025 17:32 — with GitHub Actions Inactive

Sidney-Lisanza merged commit 35c691e into main Mar 10, 2025
5 checks passed

Sidney-Lisanza deleted the structure_tokens branch March 10, 2025 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structure tokens #43

Structure tokens #43

ncfrey commented Mar 7, 2025

ncfrey left a comment

ncfrey Mar 7, 2025

karinazad Mar 7, 2025

Sidney-Lisanza Mar 7, 2025

karinazad Mar 8, 2025

karinazad Mar 7, 2025

Sidney-Lisanza Mar 7, 2025

karinazad Mar 10, 2025



		def _make_latent_generator_tokenizer() -> PreTrainedTokenizerFast:
		"""Create a `PreTrainedTokenizerFast` object for tokenization of protein structure latent generator sequences.


		from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast

		LG_VOCAB = {'<cls>': 0, '<pad>': 1, '<eos>': 2, '<unk>': 3, '<mask>': 4, '.': 5, 'a': 6, 'b': 7, 'c': 8,


		from ._make_pretrained_tokenizer_fast import make_pretrained_tokenizer_fast

		LG_VOCAB = {

Structure tokens #43

Structure tokens #43

Conversation

ncfrey commented Mar 7, 2025

ncfrey left a comment

Choose a reason for hiding this comment

ncfrey Mar 7, 2025

Choose a reason for hiding this comment

karinazad Mar 7, 2025

Choose a reason for hiding this comment

Sidney-Lisanza Mar 7, 2025

Choose a reason for hiding this comment

karinazad Mar 8, 2025

Choose a reason for hiding this comment

karinazad Mar 7, 2025

Choose a reason for hiding this comment

Sidney-Lisanza Mar 7, 2025

Choose a reason for hiding this comment

karinazad Mar 10, 2025

Choose a reason for hiding this comment