Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Vectors Dataset and Upload Index Tasks #13

Merged
merged 1 commit into from
Feb 28, 2025

Conversation

rchitale7
Copy link
Member

@rchitale7 rchitale7 commented Feb 25, 2025

Description

Draft PR for #12.

This PR implements the create_vectors_dataset and upload_index tasks. These two tasks - along with the build_index task - are used to construct a knn index on hardware accelerators, for an OpenSearch cluster. At a high level, create_vectors_dataset downloads vectors from the Remote Store repository, while upload_index uploads a knn index to the Remote Store repository. Both tasks utilize an Object Store Client to interact with the Remote Store repository; only the s3 implementation of the Object Store client is covered in this PR.

Note that the s3 readBlob function uses download_fileobj to directly read a blob into a buffer in memory. This helps avoid an unnecessary write to disk, as the build_index task already expects the vector and doc id blobs as arrays in memory. np.frombuffer is used to convert the buffer into a numpy array in place, avoiding the creation of an extra copy.

Out of scope for this PR:

  • Remote Store object checksum validation (will come in a separate PR, but part of the same issue)
  • build_index task (will come in a separate PR and issue)
  • Support for additional repositories beyond s3, such as Azure Blob Storage (will come as a separate PR and issue)

Checklist

  • First pass review
  • Unit tests
  • Documentation
  • Benchmarking with 30 GB vector dataset

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rchitale7 rchitale7 added the enhancement New feature or request label Feb 25, 2025
@rchitale7 rchitale7 self-assigned this Feb 25, 2025
@rchitale7 rchitale7 marked this pull request as ready for review February 25, 2025 18:01
@rchitale7 rchitale7 marked this pull request as draft February 25, 2025 18:06
@rchitale7 rchitale7 force-pushed the object_store branch 3 times, most recently from b359159 to e9c1dbb Compare February 27, 2025 05:37
@rchitale7 rchitale7 marked this pull request as ready for review February 27, 2025 18:10
@rchitale7
Copy link
Member Author

Currently benchmarking each task with a vector dataset, will post results here once completed

@navneet1v
Copy link
Collaborator

navneet1v commented Feb 27, 2025

This helps avoid an unnecessary write to disk, as the build_index task already expects the vector and doc id blobs as arrays in memory. np.frombuffer is used to convert the buffer into a numpy array in place, avoiding the creation of an extra copy.

how we have validated that no extra memory is getting used during the inplace conversion?

@rchitale7
Copy link
Member Author

rchitale7 commented Feb 27, 2025

@navneet1v I'll check this during benchmarking. According to the numpy docs, np.frombuffer creates a view into the original object, so it should not use extra memory but I will confirm.

docs: https://numpy.org/doc/2.1/reference/generated/numpy.frombuffer.html

@navneet1v
Copy link
Collaborator

@navneet1v I'll check this during benchmarking. According to the numpy docs, np.frombuffer creates a view into the original object, so it should not use extra memory but I will confirm.

docs: https://numpy.org/doc/2.1/reference/generated/numpy.frombuffer.html

can you test with a small like 1GB file and verify this.

@rchitale7
Copy link
Member Author

@navneet1v thanks for the call out. Although np.frombuffer creates a view of the object, calling read on a bytesIO object performs a full copy beforehand. So np.frombuffer creates a view for the copy, doubling memory usage. I modified the implementation to call getbuffer() on the bytesIO object instead, which fixed the issue - getbuffer() first creates a view, that can then be passed to np.frombuffer. Tested with 307 GB file and verified there was no additional memory usage.

Copy link
Collaborator

@jed326 jed326 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rchitale7 , this PR is looking pretty good. Mostly some minor comments and questions remaining from me.

Comment on lines +79 to +83
object_store_config (Dict[str, Any]): Configuration dictionary containing:
- retries (int): Maximum number of retry attempts (default: 3)
- region (str): AWS region name (default: 'us-west-2')
- transfer_config (Dict[str, Any]): s3 TransferConfig parameters
- debug: Turns on debug mode (default: False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should object_store_config also be it's own class like IndexBuildParameters? Or is it the point that it's super generic and a consumer can put whatever parameters they need into it? I'm also not sure if I like debug being used this way, but I think it's fine for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it should be super generic, because different object store implementations may have different configuration parameters needed.

Comment on lines +8 to +18
"""
core.tasks
~~~~~~~~~~~~~~~~~

This module contains the tasks necessary to build an index on GPUs.
These tasks must be run in the following sequence, for a given build request:
1. create_vectors_dataset
2. build_index
3. upload_index

"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there going to be a "parent" method that calls all of these tasks in sequence, or is that left up to a consumer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i will write a 'main.py' that will simply call each of the tasks in sequence. But i wanted to separate the tasks in their own file, if the consumer wants to import them individually and do something different. For example, they may want to stream the VectorDataset from buffer to disk for persistence, and then before calling build_index, stream from disk back to buffer.

Comment on lines +83 to +84
object_store.read_blob(index_build_params.vector_path, vector_bytes_buffer)
object_store.read_blob(index_build_params.doc_id_path, doc_id_bytes_buffer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can take as a future improvement, but I think we should be able to parallelize these downloads as well as the np.frombuffer with respect to each other. For example no need to wait for the vector blob download to complete before creating the doc id numpy array

Signed-off-by: Rohan Chitale <rchital@amazon.com>
@rchitale7
Copy link
Member Author

Benchmarking results for 28.6 GB vector dataset, on a g5.4xlarge machine: 117 seconds total time (for vector download and conversion to numpy array).

I also tested with download_file and np.fromfile instead of download_fileobj and np.from_buffer for the same dataset, and got 119 seconds total time. download_file writes to disk, so in theory it should be slower. But in my test, there was actually little difference between the two methods. It's possible that there's a difference for even larger datasets, but I will need to conduct more tests. For now, I'll merge this PR and revisit doing deeper benchmarking in the future.

@rchitale7 rchitale7 merged commit 01ac863 into opensearch-project:main Feb 28, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants