Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent failed to delete a region #1666

Open
leftwo opened this issue Mar 10, 2025 · 4 comments
Open

Agent failed to delete a region #1666

leftwo opened this issue Mar 10, 2025 · 4 comments
Milestone

Comments

@leftwo
Copy link
Contributor

leftwo commented Mar 10, 2025

Crucible agent had trouble deleting a newly created region:

00:49:47.053Z INFO crucible-agent (dropshot): request completed
    latency_us = 68
    local_addr = [fd00:1122:3344:106::4]:32345
    method = GET
    remote_addr = [fd00:1122:3344:10f::3]:37967
    req_id = 386ab886-b906-4b58-ad9b-09744b267f3a
    response_code = 200
    uri = /crucible/0/regions/623874c5-efc6-4b87-aa1d-cc496e46a12e
00:49:47.395Z INFO crucible-agent (worker): region files created ok
    region = 83ba6068-1df4-49ab-972b-67d36a1aa029
00:49:47.395Z INFO crucible-agent (datafile): region 83ba6068-1df4-49ab-972b-67d36a1aa029 state: Requested -> Created
00:49:47.395Z INFO crucible-agent (worker): applying SMF actions post create...
00:49:47.456Z INFO crucible-agent (worker): disabling downstairs instance: downstairs-623874c5-efc6-4b87-aa1d-cc496e46a12e (instance states: (Some(Online), None))
00:49:47.481Z INFO crucible-agent (worker): creating missing downstairs instance downstairs-83ba6068-1df4-49ab-972b-67d36a1aa029
00:49:47.485Z INFO crucible-agent (worker): ok, have svc:/oxide/crucible/downstairs:downstairs-83ba6068-1df4-49ab-972b-67d36a1aa029
00:49:47.492Z INFO crucible-agent (worker): creating config property group
00:49:47.496Z INFO crucible-agent (worker): reconfiguring svc:/oxide/crucible/downstairs:downstairs-83ba6068-1df4-49ab-972b-67d36a1aa029
00:49:47.497Z INFO crucible-agent (worker): ensure directory SCF_TYPE_ASTRING /data/regions/83ba6068-1df4-49ab-972b-67d36a1aa029
00:49:47.497Z INFO crucible-agent (worker): ensure port SCF_TYPE_COUNT 19002
00:49:47.497Z INFO crucible-agent (worker): ensure address SCF_TYPE_ASTRING fd00:1122:3344:106::4
00:49:47.497Z INFO crucible-agent (worker): commit
00:49:47.499Z INFO crucible-agent (worker): ok!
00:49:47.507Z INFO crucible-agent (worker): SMF ok!
00:49:47.507Z INFO crucible-agent (worker): applying SMF actions before removal...
00:49:47.534Z INFO crucible-agent (worker): disabling downstairs instance: downstairs-623874c5-efc6-4b87-aa1d-cc496e46a12e (instance states: (Some(Online), Some(Disabled)))
00:49:47.538Z INFO crucible-agent (worker): SMF ok!
00:49:47.566Z INFO crucible-agent (worker): deleting zfs dataset "oxp_9b1be26b-6af2-4ec4-8074-9aafadccf73d/crucible/regions/623874c5-efc6-4b87-aa1d-cc496e46a12e"
    region = 623874c5-efc6-4b87-aa1d-cc496e46a12e
00:49:47.610Z INFO crucible-agent (dropshot): request completed
    latency_us = 155
    local_addr = [fd00:1122:3344:106::4]:32345
    method = GET
    remote_addr = [fd00:1122:3344:10e::3]:50908
    req_id = f372ce6d-7510-49e8-90cb-d14814f2d5fb
    response_code = 200
    uri = /crucible/0/regions/623874c5-efc6-4b87-aa1d-cc496e46a12e
00:49:47.623Z ERRO crucible-agent (worker): zfs dataset oxp_9b1be26b-6af2-4ec4-8074-9aafadccf73d/crucible/regions/623874c5-efc6-4b87-aa1d-cc496e46a12e delete attempt 0 failed: out: err:cannot unmount '/data/regions/623874c5-efc6-4b87-aa1d-cc496e46a12e': Device busy
    region = 623874c5-efc6-4b87-aa1d-cc496e46a12e
00:49:50.965Z INFO crucible-agent (dropshot): request completed
    latency_us = 23476
    local_addr = [fd00:1122:3344:106::4]:32345
    method = GET
    remote_addr = [fd00:1122:3344:10f::3]:37967
    req_id = f215d40e-d411-495b-804a-7d4a6e6e1d30
    response_code = 200
    uri = /crucible/0/regions/623874c5-efc6-4b87-aa1d-cc496e46a12e
@leftwo
Copy link
Contributor Author

leftwo commented Mar 10, 2025

The downstairs service ended up in maintenance:

svc:/oxide/crucible/downstairs:downstairs-623874c5-efc6-4b87-aa1d-cc496e46a12e (Oxide Crucible Downstairs)
  Zone: oxz_crucible_511fbb88-39ad-4569-9858-89bdb98333b3
 State: maintenance since Wed Mar  5 00:50:59 2025
Reason: Method failed.
   See: http://illumos.org/msg/SMF-8000-8Q
   See: /pool/ext/9b1be26b-6af2-4ec4-8074-9aafadccf73d/crypt/zone/oxz_crucible_511fbb88-39ad-4569-9858-89bdb98333b3/root/var/svc/log/oxide-crucible-downstairs:downstairs-623874c5-efc6-4b87-aa1d-cc496e46a12e.log
Impact: This service is not running.

@morlandi7 morlandi7 added this to the 14 milestone Mar 10, 2025
@mkeeter
Copy link
Contributor

mkeeter commented Mar 10, 2025

Cross-linking: this is a sub-issue of https://github.com/oxidecomputer/customer-support/issues/311, which includes details about where to find the logs.

@mkeeter
Copy link
Contributor

mkeeter commented Mar 10, 2025

Quoth Angela:

the log is here: /staff/core/crucible-1666/sled14_oxide-crucible-downstairs:downstairs-623874c5-efc6-4b87-aa1d-cc496e46a12e.log.1741135859

@leftwo
Copy link
Contributor Author

leftwo commented Mar 10, 2025

Some prior examples (from dogfood sled 9) where we saw Device busy but not long enough to fail:

Match in file /pool/ext/4eb2e4eb-41d8-496c-9a5a-687d7e004aa4/crypt/debug/oxz_crucible_9c5d88c9-8ff1-4f23-9438-7b81322eaf68/oxide-crucible-agent:default.log.1714700194
2024-05-03 01:36:29.488Z ERRO crucible-agent/8668 (worker) on oxz_crucible_9c5d88c9-8ff1-4f23-9438-7b81322eaf68: zfs dataset oxp_aadf48eb-6ff0-40b5-a092-1fdd06c03e11/crucible/regions/924aae1a-51d6-4207-b739-703bffc454d4 delete attempt 0 failed: out: err:cannot unmount '/data/regions/924aae1a-51d6-4207-b739-703bffc454d4': Device busy
    region = 924aae1a-51d6-4207-b739-703bffc454d4
2024-05-03 01:36:31.530Z ERRO crucible-agent/8668 (worker) on oxz_crucible_9c5d88c9-8ff1-4f23-9438-7b81322eaf68: zfs dataset oxp_aadf48eb-6ff0-40b5-a092-1fdd06c03e11/crucible/regions/924aae1a-51d6-4207-b739-703bffc454d4 delete attempt 1 failed: out: err:cannot unmount '/data/regions/924aae1a-51d6-4207-b739-703bffc454d4': Device busy
    region = 924aae1a-51d6-4207-b739-703bffc454d4

Found like:

for zzz in $(zoneadm list | grep crucible | grep -v pantry); do echo "zone $zzz"; for file in $(/opt/oxide/oxlog/oxlog logs --archived $zzz | grep crucible-agent); do if grep "cannot unmount" $file | grep -v 2023 > /dev/null; then echo "Match in file $file"; grep "cannot unmount" $file | looker -o long;fi; done; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants