Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LuceneOnFaiss - Part3] Added FaissHNSW graph. #2594

Conversation

0ctopus13prime
Copy link
Collaborator

Description

In this PR, I've added FaissHNSW graph and a bridge that complies to Lucene's HNSW graph interface.
With the bridge, now we can use Lucene's HNSW graph searcher to perform vector search on FAISS index.

Related Issues

RFC : #2401

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look at the unit tests soon.

Looks good otherwise, minor comments

Signed-off-by: Dooyong Kim <kdooyong@amazon.com>
@0ctopus13prime 0ctopus13prime force-pushed the lucene-on-faiss-part3 branch from 2e7e464 to 517e2c4 Compare March 12, 2025 00:50
@0ctopus13prime
Copy link
Collaborator Author

0ctopus13prime commented Mar 12, 2025

@jmazanec15
Hi Jack, please let me share the required memory space upper bound for long[] offsets within this naive approach, and the result how much we can compress the space with monotonic encoding.

Conclusion : With the encoding, required memory would be shrink to 475KB from 7.45GB for 1B vectors.

Current Approach
The number of vectors : 1B
Size of offsets required : 7.45GB

With DirectMonotonicWriter
Meta bytes : 163KB
Data bytes : 312KB
Total : 475KB = (163KB + 312KB)

Will make sure to add this encoding in the next PR.
Thank you!

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, need a clarification and its good to go

Comment on lines +135 to +137
// Note that maxLevel=3 indicates that a vector exists level-0 (bottom), level-1 and level-2.
if (maxLevel > level) {
++numVectorsAtLevel;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to make sure that the vectors are not repeated on each level so we don't count extra

Copy link
Collaborator Author

@0ctopus13prime 0ctopus13prime Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be no duplicate counts as the index of levels (e.g. i) represents an internal vector id, and we increase the index sequentially.

@0ctopus13prime 0ctopus13prime merged commit a6beb8c into opensearch-project:lucene-on-faiss Mar 12, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants