-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TUF artifact distribution errors in sled agent log #7796
Comments
I later found out that this system ran out of disk space. It's conceivable that that could cause a bunch of these issues:
|
The failure to rename the temporary file is weird. We create the files here: omicron/sled-agent/src/artifact_store.rs Lines 281 to 320 in 592e8b0
Of course we only get back a omicron/sled-agent/src/artifact_store.rs Lines 485 to 513 in 592e8b0
So any errors in writing to the file will drop the omicron/sled-agent/src/artifact_store.rs Lines 529 to 577 in 592e8b0
|
I have since torn down this a4x2 and started another one and uploaded a TUF repo. This one has not run out of disk space. I'm seeing these errors:
They're all for that one artifact, d7cd26474af78aeef326f294ad0bb9238d07598262929022168ed8324488984f. And that one isn't in /pool/int/*/update. So it seems to persistently have trouble with that one artifact. Only trouble is: I don't see that shasum in my TUF repo at all. There's no file in "targets" with that name, no file has that shasum, and So what is it?
Knowing that, I found it inside my TUF repo, inside targets/2cd3225eff353a38355b46967d7c974bb4c71e2ee0c2bd5309b0ce040dbea65f.host-host-13.0.0-0.ci+git3802ae4736a.tar.gz, it's the shasum of image/zfs.img. That does appear to be the largest individual artifact:
I wonder if we're using a timeout that's not taking into account data transfer activity (e.g., an end-to-end request timeout) and isn't big enough for an artifact this size? (If it's trying and failing repeatedly, that would also explain why my a4x2 filled up in the first place. In oxidecomputer/testbed#100 I had found that the M.2 zpools were much smaller than the files used to back them, which would make sense if something was rewriting them constantly, but I couldn't figure out what was doing that.) |
I spotted Sled Agent writing the temporary file for this artifact and started watching it:
along with what you'd expect in the log around 16:53:03:
From this I think we can infer:
What's surprising then is that the next largest artifact, 311880120 bytes, ever works in 60 seconds. It seems like that should take 22 minutes, too. I wonder if I can see how long it took? That was the clickhouse artifact, which is e9cbe9cd9b4cbb246d64c50f88dd385a58b19a9749d704f57045b10184c95136:
That's on disk here but much smaller than
Comparing its creation and modification timestamps:
It looks like it maybe took over 3 minutes (unless something modified a few minutes after it was transferred). I looked for long messages about this one and did see one of the rename errors, for what it's worth:
along with a bunch of:
I'm not sure what's going on here, in terms of: why we made these requests multiple times, why the renames happened, and if there's a 60s timeout here as it sort of looked above, how did this file manage to get transferred? |
I don't see any 60-second timeout but there does appear to be a 15-second read timeout in the artifact store. I wonder if this is bursty and we hit a 15-second gap with no reads? Here's a full cycle:
Here's the delta in each second:
It's definitely bursty but hard to say if it's enough to be a problem here. |
Hmmmmm. The fact that I don't know if we want to make the timeout longer in order to deal with softnpu being slow? I guess I could add a configuration knob for the timeout. The goal of the timeout is to prevent a single broken sled from stopping artifact replication to all the other sleds, since Nexus only sends N requests at a time. |
This could be a race with the filesystem. I attempted to make it so that only one task could have a file in omicron/sled-agent/src/artifact_store.rs Lines 360 to 371 in 25a7855
but maybe there are consistency issues I'm not aware of. It's possible that task A started writing a file, task B started writing again and overwrote it, and then task A finished and persisted task B's file to the real location. Maybe we should have an in-memory list of in-progress files, and not rely on the filesystem as a "lock"? |
Yikes, you're right -- not all of these files have the right shasum:
These are wrong:
|
Yeahhhhh that's a bug alright! I guess the question is whether I'm holding POSIX filesystem semantics wrong or if we're implementing those wrong? I'm guessing it's the former. |
In chat today @jgallagher pointed out the likely culprit for why we're getting truncated files in the artifact store: we're creating this omicron/sled-agent/src/artifact_store.rs Lines 360 to 371 in d931394
I was able to write a regression test for this to help me figure out the scenario where we'd hit this. It's pretty weird, as it requires four attempts to write an artifact:
Why would there be four writers with only three Nexus instances? The artifact replication task (in theory) does not attempt to write an artifact to a sled more than once; either it is PUT directly by Nexus or it is told to get the artifact from another sled. I suppose that the severely poor network conditions of an a4x2 could result in:
Since I have a regression test and a fix, I'll put a PR up for that. In that PR I'm also getting rid of the |
I set up a4x2 (based on main + #7781), applied the workaround so that Nexus trusts root.json, and uploaded a TUF repo. This mostly worked. But I later found a bunch of errors in the sled agent log:
So far I've only looked into the first one. It's timing out trying to talk to sled g1:
That sled is up. Neither the sled nor its sled agent have restarted since well before this:
grep'ing for that first artifact id in g1's sled agent logs I see lots of requests for it that seem normal. The first one after the time of that error is:
That is from g0 based on the IP. It finished 5 minutes later than the error reported on g0 but from the latency, we can see that it was quick. I'm not sure it's the same request at all. If it is, it was queued 5 minutes somewhere.
I'm not sure what to make of the
Failed to rename temporary file
due toNo such file or directory (os error 2)
.I was doing a bunch of other stuff here but I don't think it was relevant and most of it was after these early errors.
The text was updated successfully, but these errors were encountered: