-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Design file integrity verification support for files downloaded from remote store #12294
Comments
Have we looked into whether we really need it or not? Checksum computation is a heavy CPU operation much heavier than encryption/decryption itself. If S3 sdk client is making sure of integrity then we don't need it. |
Thanks Kunal, lets evaluate what are the current set of checksums we have today, for instance when all parts are completely downloaded but before we open the file we should have Lucene checksums to validate E2E integrity of the file. The direction we should think is how do we add checksums on other primitives like block level downloads or any part of the file for that matter that can be independently loaded by the file system since they don't in my understanding have integrity checks. |
@Bukhtawar / @vikasvb90 On the download path, the current checksum mechanism is limited to the footer based Lucene checksum mechanism which is performed when the entire file is downloaded. This behavior does not truly verify the checksum by calculating it, and is instead only limited to footer based value checks and length verification - more of a sanity check. In case of streams, this can possibly lead to problems since we will download and stitch parts. We would rely on the repository implementations to only provide us with streams and would want to ensure the segment file matches with the file on the remote store and to that we have options like -
In case of other repository/transport interactions, we instead rely on the complete checksum mechanism which ensures a fail-fast mechanism by utilizing the following block: There is a gap in the current design w.r.t downloads where the failure occurs late in the query cycle, instead of the download process. |
@kotwanikunal We don't have any part level checksum verification in upload path. We only verify file integrity after upload and on successful verification we complete the upload.
Also, I guess S3 client must already be performing a data integrity check for network transfers. (We can check this). If yes, then we would be building this to ensure integrity of objects while they were stored on disk on remote. |
Is your feature request related to a problem? Please describe
Files from the remote store are downloaded using the
RemoteStoreFileDowloader
instanceThis method currently uses directory copy methods to download complete files from the repository in a blocking manner (which already provides checksum support), which will change with the support for #11461
Describe the solution you'd like
As a part of #11461 - We need to determine how the checksum will be calculated for the new stream based download mechanism
Some questions to answer are -
1. How will we calculate the checksum? Parts will complete out of order - sequential checksum isn't an option
2. Will it be stored as a part of the file (Blob) metadata when uploaded?
3. Do we rely on a different checksum algorithm instead of Lucene checksum?
Related component
Search:Remote Search
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: