Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Design file integrity verification support for files downloaded from remote store #12294

Open
kotwanikunal opened this issue Feb 12, 2024 · 6 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Remote Search

Comments

@kotwanikunal
Copy link
Member

Is your feature request related to a problem? Please describe

Files from the remote store are downloaded using the RemoteStoreFileDowloader instance

This method currently uses directory copy methods to download complete files from the repository in a blocking manner (which already provides checksum support), which will change with the support for #11461

Describe the solution you'd like

As a part of #11461 - We need to determine how the checksum will be calculated for the new stream based download mechanism

Some questions to answer are -
1. How will we calculate the checksum? Parts will complete out of order - sequential checksum isn't an option
2. Will it be stored as a part of the file (Blob) metadata when uploaded?
3. Do we rely on a different checksum algorithm instead of Lucene checksum?

Related component

Search:Remote Search

Describe alternatives you've considered

No response

Additional context

No response

@vikasvb90
Copy link
Contributor

Have we looked into whether we really need it or not? Checksum computation is a heavy CPU operation much heavier than encryption/decryption itself. If S3 sdk client is making sure of integrity then we don't need it.
We needed this in upload path because we wanted to identify whether there was any corruption on disk before the upload even started.

@vikasvb90
Copy link
Contributor

cc: @Bukhtawar @sachinpkale

@Bukhtawar
Copy link
Collaborator

Thanks Kunal, lets evaluate what are the current set of checksums we have today, for instance when all parts are completely downloaded but before we open the file we should have Lucene checksums to validate E2E integrity of the file. The direction we should think is how do we add checksums on other primitives like block level downloads or any part of the file for that matter that can be independently loaded by the file system since they don't in my understanding have integrity checks.
I concur with Vikas on skipping checksums at multiple levels given they are computationally heavy, unless of course we find optimal ways to accomplish the same.

@kotwanikunal
Copy link
Member Author

kotwanikunal commented Feb 20, 2024

@Bukhtawar / @vikasvb90
If my understanding is correct, we do a part based checksum verification for segments on upload to the repository, which ensures that there is no data loss or corruption w.r.t uploads to the remote store.

On the download path, the current checksum mechanism is limited to the footer based Lucene checksum mechanism which is performed when the entire file is downloaded.

This behavior does not truly verify the checksum by calculating it, and is instead only limited to footer based value checks and length verification - more of a sanity check.
We will discover any corrupt blocks/failures at query time with this approach (to verify: maybe leading to a corrupt index / shard corruption code flow)

In case of streams, this can possibly lead to problems since we will download and stitch parts. We would rely on the repository implementations to only provide us with streams and would want to ensure the segment file matches with the file on the remote store and to that we have options like -

  • Use the part based checksum stored on the repository objects to verify downloaded parts have the same checksum
  • Read the entire stitched file to calculate and match the Lucene checksum on the footer

In case of other repository/transport interactions, we instead rely on the complete checksum mechanism which ensures a fail-fast mechanism by utilizing the following block:
https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/store/Store.java#L630-L644

There is a gap in the current design w.r.t downloads where the failure occurs late in the query cycle, instead of the download process.

@vikasvb90
Copy link
Contributor

@kotwanikunal We don't have any part level checksum verification in upload path. We only verify file integrity after upload and on successful verification we complete the upload.
Also, we don't have any part level checksum stored anywhere so we won't have any checksum to match against for a part. But before we decide building checksum verification in download path, the questions we will need to ask are:

  1. If we don't have any part level checksum and lets say we rely on S3 client for making sure of data integrity during transfer, then do we need to build it?
  2. If we build it then what actions we want to take. If there's a corrupt file in remote store then it is a catastrophic situation.
  3. Since, we are questioning the data integrity retained for uploaded objects in S3, we can talk to S3 and find out if they are already ensuring it and if not, then what others are doing to ensure it.

Also, I guess S3 client must already be performing a data integrity check for network transfers. (We can check this). If yes, then we would be building this to ensure integrity of objects while they were stored on disk on remote.

@Poojita-Raj
Copy link
Contributor

A Network-based integrity approach is our proposed design for dealing with checksum verification for file integrity:

  1. We will be using crc-32 checksum (the checksum used by lucene).
  2. We will utilize the S3 java v2 sdk to perform part range fetches of the object (1 part = 1 stream).
  3. We will be performing checksum verification on the encrypted downloaded objects pre-decryption. This corresponds to the order of upload checksum verification which takes place post encryption.
image

The underlying assumption for this approach is that if we can verify the integrity of each part it is equivalent to verifying the integrity of the entire file.

image

Approach

  1. Our first call to get the object attributes will tell us about the existing parts for the object when it is was uploaded. We will do a part range fetch with checksum mode enabled on byte ranges that align with these part boundaries.
  2. The checksum for each part that is present on s3 and calculated by S3 with the checksum algorithm we prescribe is now our expected checksum.
  3. The expected checksum is created the moment the object is uploaded, and it is preserved throughout the lifespan of the object. The same checksum is validated at the end when the object is downloaded, to offer end-to-end data integrity.
  4. We receive each part (using byte range fetches that correspond to a part boundary) with its checksum and the S3 SDK takes care of calculating the checksum on the response bytes we fetch and then verifies that against the checksum it receives.
  5. We can retry each part on mismatch instead of waiting for the entire file to download first.

Pros

  • higher performance speedup
  • lightweight retry mechanism - for each part we verify checksum as we receive it, rather than the full file.
  • Utilizing existing sdk to take care of checksum verification - separate thread pool + logic abstracted away and less implementation on our side.

Cons

  • Doesn’t support readBlob with random offsets. Offsets provided must align with the part ranges for us to get the part checksums to verify integrity for each part.
  • This will introduce tight coupling between upload and download for speedup→ if file is not uploaded using a multipart upload we cannot perform part based checksum verification. Instead we will treat entire object as a single part range fetch of size = object size.

Cost Analysis

We will be introducing additional checksum calculations on download that do not take place right now. Current implementation only performs a nominal footer checksum check not a checksum on entire file contents.

This would introduce higher CPU utilization. The exact increase would need to be measured through benchmarking clusters to perform a thorough cost analysis.

Backwards compatibility

If we consider decryption, which is out of the scope for this design:

  1. We would require the cryptoHandler to decrypt given encrypted partSize in bytes and an encryptedPart. Currently it requires pre-encryption size which we will not have access to.
  2. We will not be supporting readBlob operation with a random byte offset.

public InputStream readBlob(String blobName, long position, long length)

@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Remote Search
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

4 participants