-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Vectors Dataset and Upload Index Tasks #13
Conversation
remote-vector-index-builder/core/common/models/index_build_parameters.py
Outdated
Show resolved
Hide resolved
remote-vector-index-builder/core/common/models/vectors_dataset.py
Outdated
Show resolved
Hide resolved
remote-vector-index-builder/core/object_store/s3/s3_object_store.py
Outdated
Show resolved
Hide resolved
remote-vector-index-builder/core/object_store/s3/s3_object_store.py
Outdated
Show resolved
Hide resolved
b359159
to
e9c1dbb
Compare
e9c1dbb
to
e5a8aae
Compare
Currently benchmarking each task with a vector dataset, will post results here once completed |
how we have validated that no extra memory is getting used during the inplace conversion? |
@navneet1v I'll check this during benchmarking. According to the numpy docs, docs: https://numpy.org/doc/2.1/reference/generated/numpy.frombuffer.html |
can you test with a small like 1GB file and verify this. |
@navneet1v thanks for the call out. Although |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rchitale7 , this PR is looking pretty good. Mostly some minor comments and questions remaining from me.
object_store_config (Dict[str, Any]): Configuration dictionary containing: | ||
- retries (int): Maximum number of retry attempts (default: 3) | ||
- region (str): AWS region name (default: 'us-west-2') | ||
- transfer_config (Dict[str, Any]): s3 TransferConfig parameters | ||
- debug: Turns on debug mode (default: False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should object_store_config
also be it's own class like IndexBuildParameters
? Or is it the point that it's super generic and a consumer can put whatever parameters they need into it? I'm also not sure if I like debug being used this way, but I think it's fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it should be super generic, because different object store implementations may have different configuration parameters needed.
remote_vector_index_builder/core/object_store/s3/s3_object_store.py
Outdated
Show resolved
Hide resolved
""" | ||
core.tasks | ||
~~~~~~~~~~~~~~~~~ | ||
|
||
This module contains the tasks necessary to build an index on GPUs. | ||
These tasks must be run in the following sequence, for a given build request: | ||
1. create_vectors_dataset | ||
2. build_index | ||
3. upload_index | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there going to be a "parent" method that calls all of these tasks in sequence, or is that left up to a consumer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i will write a 'main.py' that will simply call each of the tasks in sequence. But i wanted to separate the tasks in their own file, if the consumer wants to import them individually and do something different. For example, they may want to stream the VectorDataset from buffer to disk for persistence, and then before calling build_index, stream from disk back to buffer.
object_store.read_blob(index_build_params.vector_path, vector_bytes_buffer) | ||
object_store.read_blob(index_build_params.doc_id_path, doc_id_bytes_buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can take as a future improvement, but I think we should be able to parallelize these downloads as well as the np.frombuffer
with respect to each other. For example no need to wait for the vector blob download to complete before creating the doc id numpy array
Signed-off-by: Rohan Chitale <rchital@amazon.com>
4262ce7
to
7563580
Compare
Benchmarking results for 28.6 GB vector dataset, on a g5.4xlarge machine: 117 seconds total time (for vector download and conversion to numpy array). I also tested with |
Description
Draft PR for #12.
This PR implements the
create_vectors_dataset
andupload_index
tasks. These two tasks - along with thebuild_index
task - are used to construct a knn index on hardware accelerators, for an OpenSearch cluster. At a high level,create_vectors_dataset
downloads vectors from the Remote Store repository, whileupload_index
uploads a knn index to the Remote Store repository. Both tasks utilize an Object Store Client to interact with the Remote Store repository; only the s3 implementation of the Object Store client is covered in this PR.Note that the s3 readBlob function uses
download_fileobj
to directly read a blob into a buffer in memory. This helps avoid an unnecessary write to disk, as thebuild_index
task already expects the vector and doc id blobs as arrays in memory.np.frombuffer
is used to convert the buffer into anumpy
array in place, avoiding the creation of an extra copy.Out of scope for this PR:
build_index
task (will come in a separate PR and issue)Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.