Create Vectors Dataset and Upload Index Tasks #13

rchitale7 · 2025-02-25T17:10:49Z

Description

Draft PR for #12.

This PR implements the create_vectors_dataset and upload_index tasks. These two tasks - along with the build_index task - are used to construct a knn index on hardware accelerators, for an OpenSearch cluster. At a high level, create_vectors_dataset downloads vectors from the Remote Store repository, while upload_index uploads a knn index to the Remote Store repository. Both tasks utilize an Object Store Client to interact with the Remote Store repository; only the s3 implementation of the Object Store client is covered in this PR.

Note that the s3 readBlob function uses download_fileobj to directly read a blob into a buffer in memory. This helps avoid an unnecessary write to disk, as the build_index task already expects the vector and doc id blobs as arrays in memory. np.frombuffer is used to convert the buffer into a numpy array in place, avoiding the creation of an extra copy.

Out of scope for this PR:

Remote Store object checksum validation (will come in a separate PR, but part of the same issue)
build_index task (will come in a separate PR and issue)
Support for additional repositories beyond s3, such as Azure Blob Storage (will come as a separate PR and issue)

Checklist

First pass review
Unit tests
Documentation
Benchmarking with 30 GB vector dataset

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

remote-vector-index-builder/core/common/models/index_build_parameters.py

remote-vector-index-builder/core/common/models/vectors_dataset.py

remote-vector-index-builder/core/object_store/s3/s3_object_store.py

remote-vector-index-builder/core/tasks.py

rchitale7 · 2025-02-27T19:06:02Z

Currently benchmarking each task with a vector dataset, will post results here once completed

navneet1v · 2025-02-27T19:25:47Z

This helps avoid an unnecessary write to disk, as the build_index task already expects the vector and doc id blobs as arrays in memory. np.frombuffer is used to convert the buffer into a numpy array in place, avoiding the creation of an extra copy.

how we have validated that no extra memory is getting used during the inplace conversion?

rchitale7 · 2025-02-27T19:31:39Z

@navneet1v I'll check this during benchmarking. According to the numpy docs, np.frombuffer creates a view into the original object, so it should not use extra memory but I will confirm.

docs: https://numpy.org/doc/2.1/reference/generated/numpy.frombuffer.html

navneet1v · 2025-02-27T19:44:59Z

@navneet1v I'll check this during benchmarking. According to the numpy docs, np.frombuffer creates a view into the original object, so it should not use extra memory but I will confirm.

docs: https://numpy.org/doc/2.1/reference/generated/numpy.frombuffer.html

can you test with a small like 1GB file and verify this.

rchitale7 · 2025-02-27T23:12:27Z

@navneet1v thanks for the call out. Although np.frombuffer creates a view of the object, calling read on a bytesIO object performs a full copy beforehand. So np.frombuffer creates a view for the copy, doubling memory usage. I modified the implementation to call getbuffer() on the bytesIO object instead, which fixed the issue - getbuffer() first creates a view, that can then be passed to np.frombuffer. Tested with 307 GB file and verified there was no additional memory usage.

jed326

Thanks @rchitale7 , this PR is looking pretty good. Mostly some minor comments and questions remaining from me.

jed326 · 2025-02-27T23:44:40Z

remote_vector_index_builder/core/object_store/s3/s3_object_store.py

+            object_store_config (Dict[str, Any]): Configuration dictionary containing:
+                - retries (int): Maximum number of retry attempts (default: 3)
+                - region (str): AWS region name (default: 'us-west-2')
+                - transfer_config (Dict[str, Any]): s3 TransferConfig parameters
+                - debug: Turns on debug mode (default: False)


Should object_store_config also be it's own class like IndexBuildParameters? Or is it the point that it's super generic and a consumer can put whatever parameters they need into it? I'm also not sure if I like debug being used this way, but I think it's fine for now.

I thought it should be super generic, because different object store implementations may have different configuration parameters needed.

remote_vector_index_builder/core/object_store/s3/s3_object_store.py

jed326 · 2025-02-27T23:58:49Z

remote_vector_index_builder/core/tasks.py

+"""
+core.tasks
+~~~~~~~~~~~~~~~~~
+
+This module contains the tasks necessary to build an index on GPUs.
+These tasks must be run in the following sequence, for a given build request:
+1. create_vectors_dataset
+2. build_index
+3. upload_index
+
+"""


Is there going to be a "parent" method that calls all of these tasks in sequence, or is that left up to a consumer?

Yeah, i will write a 'main.py' that will simply call each of the tasks in sequence. But i wanted to separate the tasks in their own file, if the consumer wants to import them individually and do something different. For example, they may want to stream the VectorDataset from buffer to disk for persistence, and then before calling build_index, stream from disk back to buffer.

remote_vector_index_builder/core/common/models/vectors_dataset.py

jed326 · 2025-02-28T00:04:43Z

remote_vector_index_builder/core/tasks.py

+    object_store.read_blob(index_build_params.vector_path, vector_bytes_buffer)
+    object_store.read_blob(index_build_params.doc_id_path, doc_id_bytes_buffer)


Probably can take as a future improvement, but I think we should be able to parallelize these downloads as well as the np.frombuffer with respect to each other. For example no need to wait for the vector blob download to complete before creating the doc id numpy array

Signed-off-by: Rohan Chitale <rchital@amazon.com>

rchitale7 · 2025-02-28T23:46:41Z

Benchmarking results for 28.6 GB vector dataset, on a g5.4xlarge machine: 117 seconds total time (for vector download and conversion to numpy array).

I also tested with download_file and np.fromfile instead of download_fileobj and np.from_buffer for the same dataset, and got 119 seconds total time. download_file writes to disk, so in theory it should be slower. But in my test, there was actually little difference between the two methods. It's possible that there's a difference for even larger datasets, but I will need to conduct more tests. For now, I'll merge this PR and revisit doing deeper benchmarking in the future.

rchitale7 added the enhancement New feature or request label Feb 25, 2025

rchitale7 self-assigned this Feb 25, 2025

rchitale7 marked this pull request as ready for review February 25, 2025 18:01

rchitale7 requested review from Rajrahane, navneet1v, yigithub, vamshin, jed326, owenhalpert and neetikasinghal as code owners February 25, 2025 18:01

rchitale7 marked this pull request as draft February 25, 2025 18:06

jed326 reviewed Feb 25, 2025

View reviewed changes

rchitale7 force-pushed the object_store branch 3 times, most recently from b359159 to e9c1dbb Compare February 27, 2025 05:37

rchitale7 mentioned this pull request Feb 27, 2025

Update core package requirements, to allow GitHub CIs to pass #15

Merged

rchitale7 force-pushed the object_store branch from e9c1dbb to e5a8aae Compare February 27, 2025 18:09

rchitale7 marked this pull request as ready for review February 27, 2025 18:10

jed326 reviewed Feb 28, 2025

View reviewed changes

Added create_vectors_dataset and upload_index tasks in core package

7563580

Signed-off-by: Rohan Chitale <rchital@amazon.com>

rchitale7 force-pushed the object_store branch from 4262ce7 to 7563580 Compare February 28, 2025 18:56

jed326 approved these changes Feb 28, 2025

View reviewed changes

rchitale7 merged commit 01ac863 into opensearch-project:main Feb 28, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Vectors Dataset and Upload Index Tasks #13

Create Vectors Dataset and Upload Index Tasks #13

rchitale7 commented Feb 25, 2025 •

edited

Loading

rchitale7 commented Feb 27, 2025

navneet1v commented Feb 27, 2025 •

edited

Loading

rchitale7 commented Feb 27, 2025 •

edited

Loading

navneet1v commented Feb 27, 2025

rchitale7 commented Feb 27, 2025

jed326 left a comment

jed326 Feb 27, 2025

rchitale7 Feb 28, 2025

jed326 Feb 27, 2025

rchitale7 Feb 28, 2025

jed326 Feb 28, 2025

rchitale7 commented Feb 28, 2025

		object_store.read_blob(index_build_params.vector_path, vector_bytes_buffer)
		object_store.read_blob(index_build_params.doc_id_path, doc_id_bytes_buffer)

Create Vectors Dataset and Upload Index Tasks #13

Create Vectors Dataset and Upload Index Tasks #13

Conversation

rchitale7 commented Feb 25, 2025 • edited Loading

Description

Checklist

rchitale7 commented Feb 27, 2025

navneet1v commented Feb 27, 2025 • edited Loading

rchitale7 commented Feb 27, 2025 • edited Loading

navneet1v commented Feb 27, 2025

rchitale7 commented Feb 27, 2025

jed326 left a comment

Choose a reason for hiding this comment

jed326 Feb 27, 2025

Choose a reason for hiding this comment

rchitale7 Feb 28, 2025

Choose a reason for hiding this comment

jed326 Feb 27, 2025

Choose a reason for hiding this comment

rchitale7 Feb 28, 2025

Choose a reason for hiding this comment

jed326 Feb 28, 2025

Choose a reason for hiding this comment

rchitale7 commented Feb 28, 2025

rchitale7 commented Feb 25, 2025 •

edited

Loading

navneet1v commented Feb 27, 2025 •

edited

Loading

rchitale7 commented Feb 27, 2025 •

edited

Loading