Xet download workflow #2875

hanouticelina · 2025-02-18T17:35:57Z

Partially resolves #2713.

This PR adds the Xet download workflow implemented in xetpoc_huggingface_hub (internal). The upload one will be integrated in a separate PR

The main branch for xet storage integration is xet-integration.

Main changes:

Make hf_xet available as an optional dependency via pip install huggingface_hub[hf_xet]
Note: since it's a common part for download and upload, this has been pushed directly into xet-integration branch.
Integrate changes from the xet poc for the download workflow only.
Add tests.
Add documentation.

to try it in from this branch:

pip install -e ".[dev,hf_xet]"
export HF_DEBUG=1 #  if you want to set huggingface_hub logger to debug level
huggingface-cli download huggingface/distilbert-base-uncased-xet

cc @bpronan @assafvayner @rajatarya

HuggingFaceDocBuilderDev · 2025-02-18T17:39:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…hub into xet-download-workflow

tests/test_xet_download.py

…hub into xet-download-workflow

Wauplin

Left some comments but overall looks in pretty good shape!

Wauplin · 2025-02-28T17:15:28Z

src/huggingface_hub/file_download.py

@@ -487,6 +493,118 @@ def http_get(
        )


+def xet_get(
+    incomplete_path: Path,


Suggested change

incomplete_path: Path,

*,

incomplete_path: Path,

(nit) let's force keyword argument, easier to change things in the future

Wauplin · 2025-02-28T17:19:21Z

src/huggingface_hub/file_download.py

+        1. Creates a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks
+        2. Downloads files in parallel:
+            2.1. Prepares to write the file to disk
+            2.2. Asks the server "how is this file split into chunks?" using the file's unique hash
+                The server responds with:
+                - Which chunks make up the complete file
+                - Where each chunk can be downloaded from
+            2.3. For each needed chunk:
+                - Checks if we already have it in our local cache
+                - If not, downloads it from cloud storage (S3)
+                - Saves it to cache for future use
+                - Assembles the chunks in order to recreate the original file


Suggested change

1. Creates a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks

2. Downloads files in parallel:

2.1. Prepares to write the file to disk

2.2. Asks the server "how is this file split into chunks?" using the file's unique hash

The server responds with:

- Which chunks make up the complete file

- Where each chunk can be downloaded from

2.3. For each needed chunk:

- Checks if we already have it in our local cache

- If not, downloads it from cloud storage (S3)

- Saves it to cache for future use

- Assembles the chunks in order to recreate the original file

1. Create a local cache folder at `~/.cache/huggingface/xet/chunk-cache` to store reusable file chunks

2. Download files in parallel:

2.1. Prepare to write the file to disk

2.2. Ask the server "how is this file split into chunks?" using the file's unique hash

The server responds with:

- Which chunks make up the complete file

- Where each chunk can be downloaded from

2.3. For each needed chunk:

- Check if we already have it in our local cache

- If not, download it from cloud storage (S3)

- Save it to cache for future use

- Assemble the chunks in order to recreate the original file

(personal preference for "instructions")

Wauplin · 2025-02-28T17:21:31Z

src/huggingface_hub/file_download.py

+    # Stream file to buffer
+    progress_cm: tqdm = (
+        tqdm(  # type: ignore[assignment]
+            unit="B",
+            unit_scale=True,
+            total=expected_size,
+            initial=0,
+            desc=displayed_filename,
+            disable=True if (logger.getEffectiveLevel() == logging.NOTSET) else None,
+            # ^ set `disable=None` rather than `disable=False` by default to disable progress bar when no TTY attached
+            # see https://github.com/huggingface/huggingface_hub/pull/2000
+            name="huggingface_hub.xet_get",
+        )
+        if _tqdm_bar is None
+        else contextlib.nullcontext(_tqdm_bar)
+        # ^ `contextlib.nullcontext` mimics a context manager that does nothing
+        #   Makes it easier to use the same code path for both cases but in the later
+        #   case, the progress bar is not closed when exiting the context manager.
+    )


This could be abstracted in an utility somewhere (same as http_get)

Wauplin · 2025-02-28T17:25:37Z

src/huggingface_hub/file_download.py

-            headers=headers,
-            expected_size=expected_size,
-        )
+        if xet_metadata is not None and xet_metadata.file_hash is not None:


I think we should either:

make hf_xet a default dependency

or use the xet-path only if hf_xet is installed.

For now, I'd go with 2. so it's an opt-in process. The problem with current workflow (AFAIU) is that if hf_xet is not installed and a repo becomes "xet-enabled" then users won't be able to download things, even though they haven't changed anything to their setup.

Feel free to ignore if I misunderstood it 😬

+1 for using the xet-path only if hf_xet is installed.

If we ensure that it's installed and the repo becomes "xet-enabled", the user can still download the content even without hf_xet installed. We have backwards compatibility built into our service to support these cases. The service streams the entire file for download if you hit that endpoint.

Wauplin · 2025-02-28T17:27:02Z

src/huggingface_hub/utils/__init__.py

+from ._xet import (
+    XetMetadata,
+    parse_xet_headers,
+    refresh_xet_metadata,
+)


Suggested change

from ._xet import (

XetMetadata,

parse_xet_headers,

refresh_xet_metadata,

)

from ._xet import XetMetadata, parse_xet_headers, refresh_xet_metadata

(nit)

Wauplin · 2025-02-28T17:30:41Z

tests/test_xet_utils.py

+        constants.HUGGINGFACE_HEADER_X_XET_ENDPOINT: "https://xet.example.com",
+        constants.HUGGINGFACE_HEADER_X_XET_ACCESS_TOKEN: "xet_token_abc",
+        constants.HUGGINGFACE_HEADER_X_XET_EXPIRATION: "1234567890",


Let's not use constants in this test module to make the expected headers more readable (and also detect issue if constant value is updated without a good reason)

Wauplin · 2025-02-28T17:31:40Z

tests/test_xet_utils.py

Nice self-contained tests!

Wauplin · 2025-02-28T17:36:08Z

tests/test_xet_download.py

+        assert xet_metadata.file_hash is not None
+        assert xet_metadata.refresh_route is not None
+
+    def test_basic_download(self, tmp_path):


is it possible to test file has been downloaded using Xet? With something like

with patch("huggingface_hub.file_download.get_hf_file_metadata", side_effect=huggingface_hub.file_download.get_hf_file_metadata)

this way should should have the best of both worlds: test that xet is used (as done above but without mocked values) + test that download worked

bpronan · 2025-02-28T19:24:01Z

src/huggingface_hub/file_download.py

+                displayed_filename=filename,
+            )
+
+            # TODO: xetpoc - the http_get path is building this out, so we're replicating that logic here


I would prefer if we could either address this TODO or remove it if the logic fits in the greater download action.

bpronan

Looks great. The test coverage is fantastic.

The condition checking for hf_xet installation is the main blocker.

hanouticelina added 7 commits February 18, 2025 18:14

first draft

d189bec

Merge branch 'xet-integration' into xet-download-workflow

1fcc513

remove comment

3b606d7

hf_xet instead of xet

dc577a4

update docstring

1b581fd

fix

cf46ff0

update docstring

2df7249

hanouticelina added 8 commits February 18, 2025 18:40

simplify typing

5b2546a

quality

a917bf0

add logging

5168d50

fix tests

f50ee33

Merge branch 'xet-integration' of github.com:huggingface/huggingface_…

bff08cc

…hub into xet-download-workflow

Merge branch 'xet-integration' into xet-download-workflow

9c23350

add unit tests for xet utilities

54b40aa

first draft of download testing

6eac826

Wauplin reviewed Feb 27, 2025

View reviewed changes

tests/test_xet_download.py Outdated Show resolved Hide resolved

hanouticelina added 4 commits February 27, 2025 14:27

Merge branch 'xet-integration' of github.com:huggingface/huggingface_…

9ac3838

…hub into xet-download-workflow

more tests

0813a96

Merge branch 'xet-integration' of github.com:huggingface/huggingface_…

9da9591

…hub into xet-download-workflow

Merge branch 'xet-integration' of github.com:huggingface/huggingface_…

36166ce

…hub into xet-download-workflow

Wauplin reviewed Feb 28, 2025

View reviewed changes

bpronan reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xet download workflow #2875

Xet download workflow #2875

hanouticelina commented Feb 18, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 18, 2025

Wauplin left a comment

Wauplin Feb 28, 2025

Wauplin Feb 28, 2025

Wauplin Feb 28, 2025

Wauplin Feb 28, 2025

bpronan Feb 28, 2025 •

edited

Loading

Wauplin Feb 28, 2025

Wauplin Feb 28, 2025

Wauplin Feb 28, 2025

Wauplin Feb 28, 2025

bpronan Feb 28, 2025

bpronan left a comment •

edited

Loading

Xet download workflow #2875

Are you sure you want to change the base?

Xet download workflow #2875

Conversation

hanouticelina commented Feb 18, 2025 • edited Loading

HuggingFaceDocBuilderDev commented Feb 18, 2025

Wauplin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpronan Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpronan left a comment • edited Loading

Choose a reason for hiding this comment

hanouticelina commented Feb 18, 2025 •

edited

Loading

bpronan Feb 28, 2025 •

edited

Loading

bpronan left a comment •

edited

Loading