`worker agent experienced a fatal error; aborting job` #62

iliana · 2024-08-30T20:42:20Z

https://buildomat.eng.oxide.computer/wg/0/details/01J6JDMDGGV0TWS2FZ4NNYK5KG/IVBVeOSe2r64WYXoXFus5vZACPV8GDTwMFM49pFiK0KwvToN/01J6JDNAR2GCC48VK3B433REY1#S4121 (https://github.com/oxidecomputer/omicron/runs/29494306519)

I've seen this a few times but apparently have never filed an issue. Are we able to figure out what the fatal error was?

jclulow · 2024-08-31T08:34:55Z

Yes I believe that one is related to occasional NVMe stalls that we've seen since moving to the Nitro AWS stuff. I/O just sort of stops, the machine panics after the I/O deadman fires (~16 minutes later) and when it starts up the buildomat agent realises it has been restarted and aborts the job.

iliana · 2024-09-01T02:52:03Z

Ah. This has reactivated a memory:

For an experience similar to EBS volumes attached to Xen instances, we recommend setting nvme_core.io_timeout to the highest value possible. For current kernels, the maximum is 4294967295, while for earlier kernels the maximum is 255.

If I recall correctly the amount of time Linux's default timeout (not that that would apply here) is not long enough to deal with the fact that the underlying storage interface on Nitro hardware might crash or be updated, and it takes longer than 30 seconds to reconcile the in-flight I/O.

jclulow · 2024-09-01T02:54:28Z

That's distressingly long! The deadman is 1000 seconds, though, or 16 of your earth minutes. I'm not sure it's necessarily their fault and not something we are flubbing in the driver so far. The challenge is it's hard to make a crash dump when your disk won't speak to you!

jclulow · 2025-01-30T23:59:21Z

There is a second potential source of this sort of issue which has started to crop up lately, as seen in #67. In that case it seems to be related to explosive memory pressure during the build, which can apparently keep the agent from pinging the server for 10+ minutes, which is pretty alarming.

I need to improve the logging in the agent, and probably add a sort of I/O watchdog, to get more data on the potential causes of this class of failure.

iliana · 2025-02-01T18:57:01Z

The second source seems to be happening rather consistently on the gimlet-EVT22200007-propolis factory. I've just retried a Omicron job twice and keep hitting it consistently. What are its cores/memory compared to the aws factory?

In local builds there's definitely something I'd describe as "explosive memory pressure" when Cargo starts running the linker for any binaries using the omicron-nexus crate, which includes all of the test binaries.

jclulow · 2025-02-01T19:20:46Z

Yes it's been occurring with increasing frequency. I'm working on flight recorder data to try and characterise it more, but so far it definitely seems like a boatload of linkers (3-4 at least) running all at once each growing to 4-8GB+ maybe.

The core count and RAM size are the same in AWS and under Propolis. My current working theory is that I/O is substantially more available on the Gimlet, which actually improves the efficiency of the build enough that we race towards this condition with too many concurrent linkers and then we're able to page just fast enough to get us into serious trouble. That trouble lasts for quite some time but eventually finishes, and while it's going on we can't phone in to buildomat for ten minutes for whatever reason which causes us to abandon the worker.

It's actually difficult to even get in with SSH when it's in the hole, as I recall.

jclulow marked this as a duplicate of #67 Jan 30, 2025

papertigers mentioned this issue Feb 3, 2025

[nexus] the support bundle task should execute diag commands concurrently oxidecomputer/omicron#7461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`worker agent experienced a fatal error; aborting job` #62

`worker agent experienced a fatal error; aborting job` #62

iliana commented Aug 30, 2024

jclulow commented Aug 31, 2024

iliana commented Sep 1, 2024

jclulow commented Sep 1, 2024 •

edited

Loading

jclulow commented Jan 30, 2025

iliana commented Feb 1, 2025

jclulow commented Feb 1, 2025 •

edited

Loading

worker agent experienced a fatal error; aborting job #62

worker agent experienced a fatal error; aborting job #62

Comments

iliana commented Aug 30, 2024

jclulow commented Aug 31, 2024

iliana commented Sep 1, 2024

jclulow commented Sep 1, 2024 • edited Loading

jclulow commented Jan 30, 2025

iliana commented Feb 1, 2025

jclulow commented Feb 1, 2025 • edited Loading

`worker agent experienced a fatal error; aborting job` #62

`worker agent experienced a fatal error; aborting job` #62

jclulow commented Sep 1, 2024 •

edited

Loading

jclulow commented Feb 1, 2025 •

edited

Loading