-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xet download workflow #2875
base: xet-integration
Are you sure you want to change the base?
Xet download workflow #2875
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…hub into xet-download-workflow
…hub into xet-download-workflow
…hub into xet-download-workflow
…hub into xet-download-workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments but overall looks in pretty good shape!
@@ -487,6 +493,118 @@ def http_get( | |||
) | |||
|
|||
|
|||
def xet_get( | |||
incomplete_path: Path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incomplete_path: Path, | |
*, | |
incomplete_path: Path, |
(nit) let's force keyword argument, easier to change things in the future
1. Creates a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks | ||
2. Downloads files in parallel: | ||
2.1. Prepares to write the file to disk | ||
2.2. Asks the server "how is this file split into chunks?" using the file's unique hash | ||
The server responds with: | ||
- Which chunks make up the complete file | ||
- Where each chunk can be downloaded from | ||
2.3. For each needed chunk: | ||
- Checks if we already have it in our local cache | ||
- If not, downloads it from cloud storage (S3) | ||
- Saves it to cache for future use | ||
- Assembles the chunks in order to recreate the original file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Creates a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks | |
2. Downloads files in parallel: | |
2.1. Prepares to write the file to disk | |
2.2. Asks the server "how is this file split into chunks?" using the file's unique hash | |
The server responds with: | |
- Which chunks make up the complete file | |
- Where each chunk can be downloaded from | |
2.3. For each needed chunk: | |
- Checks if we already have it in our local cache | |
- If not, downloads it from cloud storage (S3) | |
- Saves it to cache for future use | |
- Assembles the chunks in order to recreate the original file | |
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks | |
2. Download files in parallel: | |
2.1. Prepare to write the file to disk | |
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash | |
The server responds with: | |
- Which chunks make up the complete file | |
- Where each chunk can be downloaded from | |
2.3. For each needed chunk: | |
- Check if we already have it in our local cache | |
- If not, download it from cloud storage (S3) | |
- Save it to cache for future use | |
- Assemble the chunks in order to recreate the original file |
(personal preference for "instructions")
# Stream file to buffer | ||
progress_cm: tqdm = ( | ||
tqdm( # type: ignore[assignment] | ||
unit="B", | ||
unit_scale=True, | ||
total=expected_size, | ||
initial=0, | ||
desc=displayed_filename, | ||
disable=True if (logger.getEffectiveLevel() == logging.NOTSET) else None, | ||
# ^ set `disable=None` rather than `disable=False` by default to disable progress bar when no TTY attached | ||
# see https://github.com/huggingface/huggingface_hub/pull/2000 | ||
name="huggingface_hub.xet_get", | ||
) | ||
if _tqdm_bar is None | ||
else contextlib.nullcontext(_tqdm_bar) | ||
# ^ `contextlib.nullcontext` mimics a context manager that does nothing | ||
# Makes it easier to use the same code path for both cases but in the later | ||
# case, the progress bar is not closed when exiting the context manager. | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be abstracted in an utility somewhere (same as http_get
)
headers=headers, | ||
expected_size=expected_size, | ||
) | ||
if xet_metadata is not None and xet_metadata.file_hash is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either:
- make
hf_xet
a default dependency - or use the xet-path only if
hf_xet
is installed.
For now, I'd go with 2. so it's an opt-in process. The problem with current workflow (AFAIU) is that if hf_xet
is not installed and a repo becomes "xet-enabled" then users won't be able to download things, even though they haven't changed anything to their setup.
Feel free to ignore if I misunderstood it 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for using the xet-path only if hf_xet
is installed.
If we ensure that it's installed and the repo becomes "xet-enabled", the user can still download the content even without hf_xet
installed. We have backwards compatibility built into our service to support these cases. The service streams the entire file for download if you hit that endpoint.
from ._xet import ( | ||
XetMetadata, | ||
parse_xet_headers, | ||
refresh_xet_metadata, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from ._xet import ( | |
XetMetadata, | |
parse_xet_headers, | |
refresh_xet_metadata, | |
) | |
from ._xet import XetMetadata, parse_xet_headers, refresh_xet_metadata |
(nit)
constants.HUGGINGFACE_HEADER_X_XET_ENDPOINT: "https://xet.example.com", | ||
constants.HUGGINGFACE_HEADER_X_XET_ACCESS_TOKEN: "xet_token_abc", | ||
constants.HUGGINGFACE_HEADER_X_XET_EXPIRATION: "1234567890", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use constants in this test module to make the expected headers more readable (and also detect issue if constant value is updated without a good reason)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice self-contained tests!
assert xet_metadata.file_hash is not None | ||
assert xet_metadata.refresh_route is not None | ||
|
||
def test_basic_download(self, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to test file has been downloaded using Xet? With something like
with patch("huggingface_hub.file_download.get_hf_file_metadata", side_effect=huggingface_hub.file_download.get_hf_file_metadata)
this way should should have the best of both worlds: test that xet is used (as done above but without mocked values) + test that download worked
displayed_filename=filename, | ||
) | ||
|
||
# TODO: xetpoc - the http_get path is building this out, so we're replicating that logic here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we could either address this TODO or remove it if the logic fits in the greater download action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. The test coverage is fantastic.
The condition checking for hf_xet
installation is the main blocker.
Partially resolves #2713.
This PR adds the Xet download workflow implemented in xetpoc_huggingface_hub (internal). The upload one will be integrated in a separate PR
The main branch for xet storage integration is xet-integration.
Main changes:
hf_xet
available as an optional dependency viapip install huggingface_hub[hf_xet]
Note: since it's a common part for download and upload, this has been pushed directly into xet-integration branch.
to try it in from this branch:
cc @bpronan @assafvayner @rajatarya