-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
worker agent experienced a fatal error; aborting job
#62
Comments
Yes I believe that one is related to occasional NVMe stalls that we've seen since moving to the Nitro AWS stuff. I/O just sort of stops, the machine panics after the I/O deadman fires (~16 minutes later) and when it starts up the buildomat agent realises it has been restarted and aborts the job. |
Ah. This has reactivated a memory:
If I recall correctly the amount of time Linux's default timeout (not that that would apply here) is not long enough to deal with the fact that the underlying storage interface on Nitro hardware might crash or be updated, and it takes longer than 30 seconds to reconcile the in-flight I/O. |
That's distressingly long! The deadman is 1000 seconds, though, or 16 of your earth minutes. I'm not sure it's necessarily their fault and not something we are flubbing in the driver so far. The challenge is it's hard to make a crash dump when your disk won't speak to you! |
There is a second potential source of this sort of issue which has started to crop up lately, as seen in #67. In that case it seems to be related to explosive memory pressure during the build, which can apparently keep the agent from pinging the server for 10+ minutes, which is pretty alarming. I need to improve the logging in the agent, and probably add a sort of I/O watchdog, to get more data on the potential causes of this class of failure. |
The second source seems to be happening rather consistently on the gimlet-EVT22200007-propolis factory. I've just retried a Omicron job twice and keep hitting it consistently. What are its cores/memory compared to the aws factory? In local builds there's definitely something I'd describe as "explosive memory pressure" when Cargo starts running the linker for any binaries using the omicron-nexus crate, which includes all of the test binaries. |
Yes it's been occurring with increasing frequency. I'm working on flight recorder data to try and characterise it more, but so far it definitely seems like a boatload of linkers (3-4 at least) running all at once each growing to 4-8GB+ maybe. The core count and RAM size are the same in AWS and under Propolis. My current working theory is that I/O is substantially more available on the Gimlet, which actually improves the efficiency of the build enough that we race towards this condition with too many concurrent linkers and then we're able to page just fast enough to get us into serious trouble. That trouble lasts for quite some time but eventually finishes, and while it's going on we can't phone in to buildomat for ten minutes for whatever reason which causes us to abandon the worker. It's actually difficult to even get in with SSH when it's in the hole, as I recall. |
https://buildomat.eng.oxide.computer/wg/0/details/01J6JDMDGGV0TWS2FZ4NNYK5KG/IVBVeOSe2r64WYXoXFus5vZACPV8GDTwMFM49pFiK0KwvToN/01J6JDNAR2GCC48VK3B433REY1#S4121 (https://github.com/oxidecomputer/omicron/runs/29494306519)
I've seen this a few times but apparently have never filed an issue. Are we able to figure out what the fatal error was?
The text was updated successfully, but these errors were encountered: