Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove integrity checking TODO and leave up to the vendor implementation #2578

Merged
merged 1 commit into from
Mar 6, 2025

Conversation

jed326
Copy link
Contributor

@jed326 jed326 commented Mar 5, 2025

Description

In this PR we are making the decision to leave integrity checking up to the individual vendor SDK and repository implementations. For example, in opensearch-project/OpenSearch#17396 the aws s3 SDK was bumped to automatically support integrity checking.

Integrity Checking Overview

We are making the decision to leave integrity checking to the vendor implementation in all cases. This section will give a brief overview of how this is done for S3.

When using the aws java sdk and repository-s3, during object upload a CRC32 checksum is automatically computed by the SDK, and validated + persisted to the object store. During object download, if a checksum is present in the object store, the SDK will automatically validate the downloaded object against the checksum found in the object store.

As a future improvement, we can also manually do this integrity checking if the vendor implementation indicates that integrity checking is not supported. For example, remote store does this like so: https://github.com/opensearch-project/OpenSearch/blob/3ea0a31f6f613ddd5b7bde0053195d5b212c813d/server/src/main/java/org/opensearch/common/blobstore/transfer/RemoteTransferContainer.java#L213-L222.

Tracking the future improvement here: #2579

Other

We have separate testing in KnnVectorValuesInputStreamTests that provides coverage on if the InputStream is correctly reading the KNNVectorValues, and we do not need to separately compute a checksum on this.

Related Issues

Relates #2465
Relates #2392
Relates #2391

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jed326 jed326 force-pushed the integrity-checking branch from 8928fdf to fbff2dd Compare March 5, 2025 02:40
@jed326
Copy link
Contributor Author

jed326 commented Mar 5, 2025

So it turns out the java SDK version used by core currently does not perform the checksum by default for s3. I've opened an issue in core to update this, opensearch-project/OpenSearch#17524, I think we can pause on this PR pending discussion there.

Generally speaking I do think it's better to offload the responsibility of checksum computation to the vendor SDK, or at least to the vendor repository implementation.

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be computing checksum and writing it, Do we need to checkIntrgrity on downloads or is the underlying implementation (blobContainer.read) makes sure of the integrity when read happens?

@shatejas shatejas closed this Mar 5, 2025
@shatejas shatejas reopened this Mar 5, 2025
@jed326
Copy link
Contributor Author

jed326 commented Mar 5, 2025

Actually, it turns out that core bumped the aws sdk to >2.30 less than 1 day ago: opensearch-project/OpenSearch#17396

After pulling in the latest changes on my local, I can see that even the doc id blob, which we are not manually computing a checksum for, has a checksum computed for it automatically by the aws SDK. With that, I am inclined to not perform our own manual checksum computations which involve reading the vector blob twice. This way, we leave it to the vendor SDK implementation to [1] compute and persist the checksum on upload and [2] verify the checksum on download.

Will update this PR accordingly.

@jed326 jed326 force-pushed the integrity-checking branch from fbff2dd to 6b46325 Compare March 5, 2025 19:51
@jed326 jed326 changed the title Add integrity checking to VectorRepositoryAccessor Remove integrity checking TODO and leave up to the vendor implementation Mar 5, 2025
@@ -41,6 +43,7 @@ public DocIdInputStream(KNNVectorValues<?> knnVectorValues) throws IOException {

@Override
public int read() throws IOException {
checkClosed();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a safety check as we are not always able to use this stream in a try-with-resources manner when we are relinquishing control of it to the repository implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v navneet1v merged commit 8faf388 into opensearch-project:main Mar 6, 2025
38 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants