[Feature Request] Design file integrity verification support for files downloaded from remote store #12294

kotwanikunal · 2024-02-12T22:03:04Z

Is your feature request related to a problem? Please describe

Files from the remote store are downloaded using the RemoteStoreFileDowloader instance

This method currently uses directory copy methods to download complete files from the repository in a blocking manner (which already provides checksum support), which will change with the support for #11461

Describe the solution you'd like

As a part of #11461 - We need to determine how the checksum will be calculated for the new stream based download mechanism

Some questions to answer are -
1. How will we calculate the checksum? Parts will complete out of order - sequential checksum isn't an option
2. Will it be stored as a part of the file (Blob) metadata when uploaded?
3. Do we rely on a different checksum algorithm instead of Lucene checksum?

Related component

Search:Remote Search

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

vikasvb90 · 2024-02-19T04:15:51Z

Have we looked into whether we really need it or not? Checksum computation is a heavy CPU operation much heavier than encryption/decryption itself. If S3 sdk client is making sure of integrity then we don't need it.
We needed this in upload path because we wanted to identify whether there was any corruption on disk before the upload even started.

vikasvb90 · 2024-02-19T04:17:57Z

cc: @Bukhtawar @sachinpkale

Bukhtawar · 2024-02-20T08:08:04Z

Thanks Kunal, lets evaluate what are the current set of checksums we have today, for instance when all parts are completely downloaded but before we open the file we should have Lucene checksums to validate E2E integrity of the file. The direction we should think is how do we add checksums on other primitives like block level downloads or any part of the file for that matter that can be independently loaded by the file system since they don't in my understanding have integrity checks.
I concur with Vikas on skipping checksums at multiple levels given they are computationally heavy, unless of course we find optimal ways to accomplish the same.

kotwanikunal · 2024-02-20T23:31:31Z

@Bukhtawar / @vikasvb90
If my understanding is correct, we do a part based checksum verification for segments on upload to the repository, which ensures that there is no data loss or corruption w.r.t uploads to the remote store.

On the download path, the current checksum mechanism is limited to the footer based Lucene checksum mechanism which is performed when the entire file is downloaded.

This behavior does not truly verify the checksum by calculating it, and is instead only limited to footer based value checks and length verification - more of a sanity check.
We will discover any corrupt blocks/failures at query time with this approach (to verify: maybe leading to a corrupt index / shard corruption code flow)

In case of streams, this can possibly lead to problems since we will download and stitch parts. We would rely on the repository implementations to only provide us with streams and would want to ensure the segment file matches with the file on the remote store and to that we have options like -

Use the part based checksum stored on the repository objects to verify downloaded parts have the same checksum
Read the entire stitched file to calculate and match the Lucene checksum on the footer

In case of other repository/transport interactions, we instead rely on the complete checksum mechanism which ensures a fail-fast mechanism by utilizing the following block:
https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/store/Store.java#L630-L644

There is a gap in the current design w.r.t downloads where the failure occurs late in the query cycle, instead of the download process.

vikasvb90 · 2024-02-21T03:13:33Z

@kotwanikunal We don't have any part level checksum verification in upload path. We only verify file integrity after upload and on successful verification we complete the upload.
Also, we don't have any part level checksum stored anywhere so we won't have any checksum to match against for a part. But before we decide building checksum verification in download path, the questions we will need to ask are:

If we don't have any part level checksum and lets say we rely on S3 client for making sure of data integrity during transfer, then do we need to build it?
If we build it then what actions we want to take. If there's a corrupt file in remote store then it is a catastrophic situation.
Since, we are questioning the data integrity retained for uploaded objects in S3, we can talk to S3 and find out if they are already ensuring it and if not, then what others are doing to ensure it.

Also, I guess S3 client must already be performing a data integrity check for network transfers. (We can check this). If yes, then we would be building this to ensure integrity of objects while they were stored on disk on remote.

Poojita-Raj · 2024-05-09T22:24:39Z

A Network-based integrity approach is our proposed design for dealing with checksum verification for file integrity:

We will be using crc-32 checksum (the checksum used by lucene).
We will utilize the S3 java v2 sdk to perform part range fetches of the object (1 part = 1 stream).
We will be performing checksum verification on the encrypted downloaded objects pre-decryption. This corresponds to the order of upload checksum verification which takes place post encryption.

The underlying assumption for this approach is that if we can verify the integrity of each part it is equivalent to verifying the integrity of the entire file.

Approach

Our first call to get the object attributes will tell us about the existing parts for the object when it is was uploaded. We will do a part range fetch with checksum mode enabled on byte ranges that align with these part boundaries.
The checksum for each part that is present on s3 and calculated by S3 with the checksum algorithm we prescribe is now our expected checksum.
The expected checksum is created the moment the object is uploaded, and it is preserved throughout the lifespan of the object. The same checksum is validated at the end when the object is downloaded, to offer end-to-end data integrity.
We receive each part (using byte range fetches that correspond to a part boundary) with its checksum and the S3 SDK takes care of calculating the checksum on the response bytes we fetch and then verifies that against the checksum it receives.
We can retry each part on mismatch instead of waiting for the entire file to download first.

Pros

higher performance speedup
lightweight retry mechanism - for each part we verify checksum as we receive it, rather than the full file.
Utilizing existing sdk to take care of checksum verification - separate thread pool + logic abstracted away and less implementation on our side.

Cons

Doesn’t support readBlob with random offsets. Offsets provided must align with the part ranges for us to get the part checksums to verify integrity for each part.
This will introduce tight coupling between upload and download for speedup→ if file is not uploaded using a multipart upload we cannot perform part based checksum verification. Instead we will treat entire object as a single part range fetch of size = object size.

Cost Analysis

We will be introducing additional checksum calculations on download that do not take place right now. Current implementation only performs a nominal footer checksum check not a checksum on entire file contents.

This would introduce higher CPU utilization. The exact increase would need to be measured through benchmarking clusters to perform a thorough cost analysis.

Backwards compatibility

If we consider decryption, which is out of the scope for this design:

We would require the cryptoHandler to decrypt given encrypted partSize in bytes and an encryptedPart. Currently it requires pre-encryption size which we will not have access to.
We will not be supporting readBlob operation with a random byte offset.

public InputStream readBlob(String blobName, long position, long length)

kotwanikunal added enhancement Enhancement or improvement to existing feature or request untriaged Search:Remote Search labels Feb 12, 2024

github-project-automation bot added this to Search Project Board Feb 12, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 12, 2024

kotwanikunal removed the untriaged label Feb 12, 2024

Rishikesh1159 assigned Poojita-Raj Feb 14, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Design file integrity verification support for files downloaded from remote store #12294

[Feature Request] Design file integrity verification support for files downloaded from remote store #12294

kotwanikunal commented Feb 12, 2024

vikasvb90 commented Feb 19, 2024

vikasvb90 commented Feb 19, 2024

Bukhtawar commented Feb 20, 2024

kotwanikunal commented Feb 20, 2024 •

edited

Loading

vikasvb90 commented Feb 21, 2024

Poojita-Raj commented May 9, 2024

[Feature Request] Design file integrity verification support for files downloaded from remote store #12294

[Feature Request] Design file integrity verification support for files downloaded from remote store #12294

Comments

kotwanikunal commented Feb 12, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

vikasvb90 commented Feb 19, 2024

vikasvb90 commented Feb 19, 2024

Bukhtawar commented Feb 20, 2024

kotwanikunal commented Feb 20, 2024 • edited Loading

vikasvb90 commented Feb 21, 2024

Poojita-Raj commented May 9, 2024

kotwanikunal commented Feb 20, 2024 •

edited

Loading