Remove integrity checking TODO and leave up to the vendor implementation #2578

jed326 · 2025-03-05T00:12:26Z

Description

In this PR we are making the decision to leave integrity checking up to the individual vendor SDK and repository implementations. For example, in opensearch-project/OpenSearch#17396 the aws s3 SDK was bumped to automatically support integrity checking.

Integrity Checking Overview

We are making the decision to leave integrity checking to the vendor implementation in all cases. This section will give a brief overview of how this is done for S3.

When using the aws java sdk and repository-s3, during object upload a CRC32 checksum is automatically computed by the SDK, and validated + persisted to the object store. During object download, if a checksum is present in the object store, the SDK will automatically validate the downloaded object against the checksum found in the object store.

As a future improvement, we can also manually do this integrity checking if the vendor implementation indicates that integrity checking is not supported. For example, remote store does this like so: https://github.com/opensearch-project/OpenSearch/blob/3ea0a31f6f613ddd5b7bde0053195d5b212c813d/server/src/main/java/org/opensearch/common/blobstore/transfer/RemoteTransferContainer.java#L213-L222.

Tracking the future improvement here: #2579

Other

We have separate testing in KnnVectorValuesInputStreamTests that provides coverage on if the InputStream is correctly reading the KNNVectorValues, and we do not need to separately compute a checksum on this.

Related Issues

Relates #2465
Relates #2392
Relates #2391

Check List

New functionality includes testing.
~~- [ ] New functionality has been documented.~~
~~- [ ] API changes companion pull request created.~~
Commits are signed per the DCO using --signoff.
~~- [ ] Public documentation issue/PR created.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

.../java/org/opensearch/knn/index/codec/nativeindex/remote/DefaultVectorRepositoryAccessor.java

jed326 · 2025-03-05T19:24:28Z

So it turns out the java SDK version used by core currently does not perform the checksum by default for s3. I've opened an issue in core to update this, opensearch-project/OpenSearch#17524, I think we can pause on this PR pending discussion there.

Generally speaking I do think it's better to offload the responsibility of checksum computation to the vendor SDK, or at least to the vendor repository implementation.

shatejas

This seems to be computing checksum and writing it, Do we need to checkIntrgrity on downloads or is the underlying implementation (blobContainer.read) makes sure of the integrity when read happens?

jed326 · 2025-03-05T19:41:57Z

Actually, it turns out that core bumped the aws sdk to >2.30 less than 1 day ago: opensearch-project/OpenSearch#17396

After pulling in the latest changes on my local, I can see that even the doc id blob, which we are not manually computing a checksum for, has a checksum computed for it automatically by the aws SDK. With that, I am inclined to not perform our own manual checksum computations which involve reading the vector blob twice. This way, we leave it to the vendor SDK implementation to [1] compute and persist the checksum on upload and [2] verify the checksum on download.

Will update this PR accordingly.

Signed-off-by: Jay Deng <jayd0104@gmail.com>

navneet1v · 2025-03-06T18:29:50Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/remote/DocIdInputStream.java

@@ -41,6 +43,7 @@ public DocIdInputStream(KNNVectorValues<?> knnVectorValues) throws IOException {

    @Override
    public int read() throws IOException {
+        checkClosed();


why we need this check?

This is a safety check as we are not always able to use this stream in a try-with-resources manner when we are relinquishing control of it to the repository implementation.

This is how many other streams are implemented, for example: https://github.com/opensearch-project/OpenSearch/blob/73453718a1c85565a3f7e4309d6fa83bf9a30522/plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/S3RetryingInputStream.java#L157-L160

jed326 force-pushed the integrity-checking branch from a2f08fe to 8d1669c Compare March 5, 2025 00:13

jed326 added the v3.0.0 label Mar 5, 2025

jed326 marked this pull request as ready for review March 5, 2025 00:20

jed326 requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan, luyuncheng, shatejas, 0ctopus13prime and Vikasht34 as code owners March 5, 2025 00:20

This was referenced Mar 5, 2025

[Meta] Remote Vector Index Build Component in OpenSearch Vector Engine #2391

Open

[Remote Vector Index Build] Add integrity checking for vendor implementations that do not support it out of the box #2579

Open

jed326 force-pushed the integrity-checking branch from 8d1669c to 8928fdf Compare March 5, 2025 01:03

owenhalpert approved these changes Mar 5, 2025

View reviewed changes

Gankris96 reviewed Mar 5, 2025

View reviewed changes

.../java/org/opensearch/knn/index/codec/nativeindex/remote/DefaultVectorRepositoryAccessor.java Outdated Show resolved Hide resolved

.../java/org/opensearch/knn/index/codec/nativeindex/remote/DefaultVectorRepositoryAccessor.java Outdated Show resolved Hide resolved

jed326 force-pushed the integrity-checking branch from 8928fdf to fbff2dd Compare March 5, 2025 02:40

Gankris96 approved these changes Mar 5, 2025

View reviewed changes

navneet1v reviewed Mar 5, 2025

View reviewed changes

.../java/org/opensearch/knn/index/codec/nativeindex/remote/DefaultVectorRepositoryAccessor.java Outdated Show resolved Hide resolved

jed326 mentioned this pull request Mar 5, 2025

[Feature Request] Upgrade aws-sdk-java to >2.30.0 opensearch-project/OpenSearch#17524

Closed

shatejas reviewed Mar 5, 2025

View reviewed changes

shatejas closed this Mar 5, 2025

shatejas reopened this Mar 5, 2025

Remove integrity checking TODO and leave to the vendor implementation

6b46325

Signed-off-by: Jay Deng <jayd0104@gmail.com>

jed326 force-pushed the integrity-checking branch from fbff2dd to 6b46325 Compare March 5, 2025 19:51

jed326 added the skip-changelog label Mar 5, 2025

jed326 changed the title ~~Add integrity checking to VectorRepositoryAccessor~~ Remove integrity checking TODO and leave up to the vendor implementation Mar 5, 2025

jmazanec15 approved these changes Mar 6, 2025

View reviewed changes

shatejas approved these changes Mar 6, 2025

View reviewed changes

navneet1v reviewed Mar 6, 2025

View reviewed changes

navneet1v approved these changes Mar 6, 2025

View reviewed changes

navneet1v merged commit 8faf388 into opensearch-project:main Mar 6, 2025
38 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove integrity checking TODO and leave up to the vendor implementation #2578

Remove integrity checking TODO and leave up to the vendor implementation #2578

jed326 commented Mar 5, 2025 •

edited

Loading

jed326 commented Mar 5, 2025

shatejas left a comment

jed326 commented Mar 5, 2025

navneet1v Mar 6, 2025

jed326 Mar 6, 2025

jed326 Mar 6, 2025

Remove integrity checking TODO and leave up to the vendor implementation #2578

Remove integrity checking TODO and leave up to the vendor implementation #2578

Conversation

jed326 commented Mar 5, 2025 • edited Loading

Description

Integrity Checking Overview

Other

Related Issues

Check List

jed326 commented Mar 5, 2025

shatejas left a comment

Choose a reason for hiding this comment

jed326 commented Mar 5, 2025

navneet1v Mar 6, 2025

Choose a reason for hiding this comment

jed326 Mar 6, 2025

Choose a reason for hiding this comment

jed326 Mar 6, 2025

Choose a reason for hiding this comment

jed326 commented Mar 5, 2025 •

edited

Loading