Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker agent experienced a fatal error; aborting job #62

Open
iliana opened this issue Aug 30, 2024 · 6 comments
Open

worker agent experienced a fatal error; aborting job #62

iliana opened this issue Aug 30, 2024 · 6 comments

Comments

@iliana
Copy link

iliana commented Aug 30, 2024

https://buildomat.eng.oxide.computer/wg/0/details/01J6JDMDGGV0TWS2FZ4NNYK5KG/IVBVeOSe2r64WYXoXFus5vZACPV8GDTwMFM49pFiK0KwvToN/01J6JDNAR2GCC48VK3B433REY1#S4121 (https://github.com/oxidecomputer/omicron/runs/29494306519)

I've seen this a few times but apparently have never filed an issue. Are we able to figure out what the fatal error was?

@jclulow
Copy link
Collaborator

jclulow commented Aug 31, 2024

Yes I believe that one is related to occasional NVMe stalls that we've seen since moving to the Nitro AWS stuff. I/O just sort of stops, the machine panics after the I/O deadman fires (~16 minutes later) and when it starts up the buildomat agent realises it has been restarted and aborts the job.

@iliana
Copy link
Author

iliana commented Sep 1, 2024

Ah. This has reactivated a memory:

For an experience similar to EBS volumes attached to Xen instances, we recommend setting nvme_core.io_timeout to the highest value possible. For current kernels, the maximum is 4294967295, while for earlier kernels the maximum is 255.

If I recall correctly the amount of time Linux's default timeout (not that that would apply here) is not long enough to deal with the fact that the underlying storage interface on Nitro hardware might crash or be updated, and it takes longer than 30 seconds to reconcile the in-flight I/O.

@jclulow
Copy link
Collaborator

jclulow commented Sep 1, 2024

That's distressingly long! The deadman is 1000 seconds, though, or 16 of your earth minutes. I'm not sure it's necessarily their fault and not something we are flubbing in the driver so far. The challenge is it's hard to make a crash dump when your disk won't speak to you!

@jclulow jclulow marked this as a duplicate of #67 Jan 30, 2025
@jclulow
Copy link
Collaborator

jclulow commented Jan 30, 2025

There is a second potential source of this sort of issue which has started to crop up lately, as seen in #67. In that case it seems to be related to explosive memory pressure during the build, which can apparently keep the agent from pinging the server for 10+ minutes, which is pretty alarming.

I need to improve the logging in the agent, and probably add a sort of I/O watchdog, to get more data on the potential causes of this class of failure.

@iliana
Copy link
Author

iliana commented Feb 1, 2025

The second source seems to be happening rather consistently on the gimlet-EVT22200007-propolis factory. I've just retried a Omicron job twice and keep hitting it consistently. What are its cores/memory compared to the aws factory?

In local builds there's definitely something I'd describe as "explosive memory pressure" when Cargo starts running the linker for any binaries using the omicron-nexus crate, which includes all of the test binaries.

@jclulow
Copy link
Collaborator

jclulow commented Feb 1, 2025

Yes it's been occurring with increasing frequency. I'm working on flight recorder data to try and characterise it more, but so far it definitely seems like a boatload of linkers (3-4 at least) running all at once each growing to 4-8GB+ maybe.

The core count and RAM size are the same in AWS and under Propolis. My current working theory is that I/O is substantially more available on the Gimlet, which actually improves the efficiency of the build enough that we race towards this condition with too many concurrent linkers and then we're able to page just fast enough to get us into serious trouble. That trouble lasts for quite some time but eventually finishes, and while it's going on we can't phone in to buildomat for ten minutes for whatever reason which causes us to abandon the worker.

It's actually difficult to even get in with SSH when it's in the hole, as I recall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants