Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

artifact store: fix correctness bug; remove copy read timeout #7866

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

iliana
Copy link
Contributor

@iliana iliana commented Mar 25, 2025

This fixes the two main problems identified in #7796:

  • In some situations a truncated artifact can be persisted to the artifact store. (yikes!)
  • Uploading a real TUF repo to an a4x2 cluster results in lots of read timeouts.

In the previous implementation, we always wrote temporary files to tmp/{sha256}, and returned an error if that file was already present. Due to the Drop behavior of Utf8TempFile, what I think was my attempting to simplify the code resulted in deleting another task's temporary file when we return an "already in progress" error. (This PR adds a regression test for this issue, which I've verified fails as expected on the current implementation.)

This replaces the temporary file persisting logic with the atomicwrites crate as suggested by @sunshowers, which is resistant to trivial mistakes like this. AtomicFile::write takes a function which must perform the entire write operation; if the function returns an error, the file is not renamed to the final path. To make it work in an asynchronous context, AtomicFile::write is placed on a blocking task and bytes are sent over an mpsc channel. The write task also creates a checksum of the data, returning an error to prevent persisting the file if the checksum is invalid.

atomicwrites also syncs the file as well as the parent directories, meaning we can remove that code in our implementation.

Since the logic for detecting multiple in-progress transfers relied on the predictable naming of temporary files, I removed that. The original reasoning for checking this was to avoid doing unnecessary work. The current implementation allows writing an artifact that already exists though, so long as someone else isn't trying to write it at the same time. (I intentionally allowed overwriting an existing artifact to allow Nexus to work around an incorrectly-written artifact in the store, if it noticed such a thing happening; it doesn't currently.) I don't know which is better; in theory allowing multiple writers means that if one fails spuriously the other that's running could still succeed. It would be simple enough to add the proposed change in #7860 to this PR as well.

The reason the 15 second read timeout was in place for copy requests was to avoid it entirely holding up any other attempts to replicate the artifact. Since we would no longer stop these attempts the read timeout can be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TUF artifact distribution errors in sled agent log
1 participant