artifact store: fix correctness bug; remove copy read timeout #7866
+224
−181
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes the two main problems identified in #7796:
In the previous implementation, we always wrote temporary files to
tmp/{sha256}
, and returned an error if that file was already present. Due to the Drop behavior of Utf8TempFile, what I think was my attempting to simplify the code resulted in deleting another task's temporary file when we return an "already in progress" error. (This PR adds a regression test for this issue, which I've verified fails as expected on the current implementation.)This replaces the temporary file persisting logic with the atomicwrites crate as suggested by @sunshowers, which is resistant to trivial mistakes like this. AtomicFile::write takes a function which must perform the entire write operation; if the function returns an error, the file is not renamed to the final path. To make it work in an asynchronous context, AtomicFile::write is placed on a blocking task and bytes are sent over an mpsc channel. The write task also creates a checksum of the data, returning an error to prevent persisting the file if the checksum is invalid.
atomicwrites also syncs the file as well as the parent directories, meaning we can remove that code in our implementation.
Since the logic for detecting multiple in-progress transfers relied on the predictable naming of temporary files, I removed that. The original reasoning for checking this was to avoid doing unnecessary work. The current implementation allows writing an artifact that already exists though, so long as someone else isn't trying to write it at the same time. (I intentionally allowed overwriting an existing artifact to allow Nexus to work around an incorrectly-written artifact in the store, if it noticed such a thing happening; it doesn't currently.) I don't know which is better; in theory allowing multiple writers means that if one fails spuriously the other that's running could still succeed. It would be simple enough to add the proposed change in #7860 to this PR as well.
The reason the 15 second read timeout was in place for copy requests was to avoid it entirely holding up any other attempts to replicate the artifact. Since we would no longer stop these attempts the read timeout can be removed.