Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xet download workflow #2875

Draft
wants to merge 19 commits into
base: xet-integration
Choose a base branch
from
Draft

Conversation

hanouticelina
Copy link
Contributor

@hanouticelina hanouticelina commented Feb 18, 2025

Partially resolves #2713.

This PR adds the Xet download workflow implemented in xetpoc_huggingface_hub (internal). The upload one will be integrated in a separate PR

The main branch for xet storage integration is xet-integration.

Main changes:

  • Make hf_xet available as an optional dependency via pip install huggingface_hub[hf_xet]
    Note: since it's a common part for download and upload, this has been pushed directly into xet-integration branch.
  • Integrate changes from the xet poc for the download workflow only.
  • Add tests.
  • Add documentation.

to try it in from this branch:

pip install -e ".[dev,hf_xet]"
export HF_DEBUG=1 #  if you want to set huggingface_hub logger to debug level
huggingface-cli download huggingface/distilbert-base-uncased-xet

cc @bpronan @assafvayner @rajatarya

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments but overall looks in pretty good shape!

@@ -487,6 +493,118 @@ def http_get(
)


def xet_get(
incomplete_path: Path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
incomplete_path: Path,
*,
incomplete_path: Path,

(nit) let's force keyword argument, easier to change things in the future

Comment on lines +535 to +546
1. Creates a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Downloads files in parallel:
2.1. Prepares to write the file to disk
2.2. Asks the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, downloads it from cloud storage (S3)
- Saves it to cache for future use
- Assembles the chunks in order to recreate the original file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Creates a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Downloads files in parallel:
2.1. Prepares to write the file to disk
2.2. Asks the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Checks if we already have it in our local cache
- If not, downloads it from cloud storage (S3)
- Saves it to cache for future use
- Assembles the chunks in order to recreate the original file
1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
2. Download files in parallel:
2.1. Prepare to write the file to disk
2.2. Ask the server "how is this file split into chunks?" using the file's unique hash
The server responds with:
- Which chunks make up the complete file
- Where each chunk can be downloaded from
2.3. For each needed chunk:
- Check if we already have it in our local cache
- If not, download it from cloud storage (S3)
- Save it to cache for future use
- Assemble the chunks in order to recreate the original file

(personal preference for "instructions")

Comment on lines +574 to +592
# Stream file to buffer
progress_cm: tqdm = (
tqdm( # type: ignore[assignment]
unit="B",
unit_scale=True,
total=expected_size,
initial=0,
desc=displayed_filename,
disable=True if (logger.getEffectiveLevel() == logging.NOTSET) else None,
# ^ set `disable=None` rather than `disable=False` by default to disable progress bar when no TTY attached
# see https://github.com/huggingface/huggingface_hub/pull/2000
name="huggingface_hub.xet_get",
)
if _tqdm_bar is None
else contextlib.nullcontext(_tqdm_bar)
# ^ `contextlib.nullcontext` mimics a context manager that does nothing
# Makes it easier to use the same code path for both cases but in the later
# case, the progress bar is not closed when exiting the context manager.
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be abstracted in an utility somewhere (same as http_get)

headers=headers,
expected_size=expected_size,
)
if xet_metadata is not None and xet_metadata.file_hash is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either:

  1. make hf_xet a default dependency
  2. or use the xet-path only if hf_xet is installed.

For now, I'd go with 2. so it's an opt-in process. The problem with current workflow (AFAIU) is that if hf_xet is not installed and a repo becomes "xet-enabled" then users won't be able to download things, even though they haven't changed anything to their setup.

Feel free to ignore if I misunderstood it 😬

Copy link
Collaborator

@bpronan bpronan Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using the xet-path only if hf_xet is installed.

If we ensure that it's installed and the repo becomes "xet-enabled", the user can still download the content even without hf_xet installed. We have backwards compatibility built into our service to support these cases. The service streams the entire file for download if you hit that endpoint.

Comment on lines +110 to +114
from ._xet import (
XetMetadata,
parse_xet_headers,
refresh_xet_metadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from ._xet import (
XetMetadata,
parse_xet_headers,
refresh_xet_metadata,
)
from ._xet import XetMetadata, parse_xet_headers, refresh_xet_metadata

(nit)

Comment on lines +16 to +18
constants.HUGGINGFACE_HEADER_X_XET_ENDPOINT: "https://xet.example.com",
constants.HUGGINGFACE_HEADER_X_XET_ACCESS_TOKEN: "xet_token_abc",
constants.HUGGINGFACE_HEADER_X_XET_EXPIRATION: "1234567890",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use constants in this test module to make the expected headers more readable (and also detect issue if constant value is updated without a good reason)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice self-contained tests!

assert xet_metadata.file_hash is not None
assert xet_metadata.refresh_route is not None

def test_basic_download(self, tmp_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to test file has been downloaded using Xet? With something like

with patch("huggingface_hub.file_download.get_hf_file_metadata", side_effect=huggingface_hub.file_download.get_hf_file_metadata)

this way should should have the best of both worlds: test that xet is used (as done above but without mocked values) + test that download worked

displayed_filename=filename,
)

# TODO: xetpoc - the http_get path is building this out, so we're replicating that logic here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we could either address this TODO or remove it if the logic fits in the greater download action.

Copy link
Collaborator

@bpronan bpronan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. The test coverage is fantastic.

The condition checking for hf_xet installation is the main blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants