-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove integrity checking TODO and leave up to the vendor implementation #2578
Conversation
a2f08fe
to
8d1669c
Compare
8d1669c
to
8928fdf
Compare
.../java/org/opensearch/knn/index/codec/nativeindex/remote/DefaultVectorRepositoryAccessor.java
Outdated
Show resolved
Hide resolved
.../java/org/opensearch/knn/index/codec/nativeindex/remote/DefaultVectorRepositoryAccessor.java
Outdated
Show resolved
Hide resolved
8928fdf
to
fbff2dd
Compare
.../java/org/opensearch/knn/index/codec/nativeindex/remote/DefaultVectorRepositoryAccessor.java
Outdated
Show resolved
Hide resolved
So it turns out the java SDK version used by core currently does not perform the checksum by default for s3. I've opened an issue in core to update this, opensearch-project/OpenSearch#17524, I think we can pause on this PR pending discussion there. Generally speaking I do think it's better to offload the responsibility of checksum computation to the vendor SDK, or at least to the vendor repository implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be computing checksum and writing it, Do we need to checkIntrgrity on downloads or is the underlying implementation (blobContainer.read) makes sure of the integrity when read happens?
Actually, it turns out that core bumped the aws sdk to >2.30 less than 1 day ago: opensearch-project/OpenSearch#17396 After pulling in the latest changes on my local, I can see that even the doc id blob, which we are not manually computing a checksum for, has a checksum computed for it automatically by the aws SDK. With that, I am inclined to not perform our own manual checksum computations which involve reading the vector blob twice. This way, we leave it to the vendor SDK implementation to [1] compute and persist the checksum on upload and [2] verify the checksum on download. Will update this PR accordingly. |
Signed-off-by: Jay Deng <jayd0104@gmail.com>
fbff2dd
to
6b46325
Compare
@@ -41,6 +43,7 @@ public DocIdInputStream(KNNVectorValues<?> knnVectorValues) throws IOException { | |||
|
|||
@Override | |||
public int read() throws IOException { | |||
checkClosed(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a safety check as we are not always able to use this stream in a try-with-resources
manner when we are relinquishing control of it to the repository implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how many other streams are implemented, for example: https://github.com/opensearch-project/OpenSearch/blob/73453718a1c85565a3f7e4309d6fa83bf9a30522/plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3RetryingInputStream.java#L157-L160
Description
In this PR we are making the decision to leave integrity checking up to the individual vendor SDK and repository implementations. For example, in opensearch-project/OpenSearch#17396 the aws s3 SDK was bumped to automatically support integrity checking.
Integrity Checking Overview
We are making the decision to leave integrity checking to the vendor implementation in all cases. This section will give a brief overview of how this is done for S3.
When using the aws java sdk and repository-s3, during object upload a CRC32 checksum is automatically computed by the SDK, and validated + persisted to the object store. During object download, if a checksum is present in the object store, the SDK will automatically validate the downloaded object against the checksum found in the object store.
As a future improvement, we can also manually do this integrity checking if the vendor implementation indicates that integrity checking is not supported. For example, remote store does this like so: https://github.com/opensearch-project/OpenSearch/blob/3ea0a31f6f613ddd5b7bde0053195d5b212c813d/server/src/main/java/org/opensearch/common/blobstore/transfer/RemoteTransferContainer.java#L213-L222.
Tracking the future improvement here: #2579
Other
We have separate testing in
KnnVectorValuesInputStreamTests
that provides coverage on if the InputStream is correctly reading theKNNVectorValues
, and we do not need to separately compute a checksum on this.Related Issues
Relates #2465
Relates #2392
Relates #2391
Check List
- [ ] New functionality has been documented.- [ ] API changes companion pull request created.--signoff
.- [ ] Public documentation issue/PR created.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.