Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AN-356] When a cluster fails to start up, don't detach persistent disk #4821

Merged
merged 2 commits into from
Jan 16, 2025

Conversation

lucymcnatt
Copy link
Collaborator

@lucymcnatt lucymcnatt commented Jan 13, 2025

Jira ticket: https://broadworkbench.atlassian.net/browse/AN-356

Summary of changes

What

This PR moves the detach logic so that the disk is only detached when a runtime fails to create and the disk isn't in creating or failed

Why

When the startup script fails (due to a full disk etc) the persistent disk becomes ‘detached’ in the DB. (The disk id is removed from the RUNTIME_CONFIG)

This means that the user cannot even try to increase their disk size in the UI when they have a full disk because they will get a persistent disk not found for runtime error.

Testing these changes

What to test

  • create a jupyter runtime with a small disk
  • go to the terminal and df -Th to see how much space is available on sdb
  • fallocate a file to fill up the space
  • pause the runtime
  • start the runtime --> it should fail
  • increase the disk size and starting again

Who tested and where

  • This change is covered by automated tests
    • NB: Rerun automation tests on this PR by commenting jenkins retest or jenkins multi-test.
  • I validated this change
  • Primary reviewer validated this change
  • I validated this change in the dev environment

@lucymcnatt lucymcnatt marked this pull request as ready for review January 13, 2025 21:57
@lucymcnatt lucymcnatt requested a review from a team as a code owner January 13, 2025 21:57
Copy link

codecov bot commented Jan 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.62%. Comparing base (dce08ef) to head (055d80b).
Report is 2 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4821      +/-   ##
===========================================
- Coverage    74.62%   74.62%   -0.01%     
===========================================
  Files          166      166              
  Lines        14692    14690       -2     
  Branches      1135     1158      +23     
===========================================
- Hits         10964    10962       -2     
  Misses        3728     3728              
Files with missing lines Coverage Δ
...nardo/monitor/BaseCloudServiceRuntimeMonitor.scala 89.35% <100.00%> (-0.09%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dce08ef...055d80b. Read the comment docs.

@@ -101,6 +101,7 @@ class BaseCloudServiceRuntimeMonitorSpec extends AnyFlatSpec with Matchers with
disk <- makePersistentDisk().save()
start <- IO.realTimeInstant
tid <- traceId.ask[TraceId]
implicit0(ec: ExecutionContext) = scala.concurrent.ExecutionContext.Implicits.global
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does his do? Still wrapping my head around how to use implicits properly 😭

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...honestly I'm not entirely sure myself, just that I needed the EC implicit to do the disk query

@@ -303,6 +311,47 @@ class BaseCloudServiceRuntimeMonitorSpec extends AnyFlatSpec with Matchers with
res.unsafeRunSync()(cats.effect.unsafe.IORuntime.global)
}

it should "detach Ready disk on failed runtime create" in isolatedDbTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking, should we also detach the PD when the runtime is in deleting status? not just deleted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's done as a part of the deletedRuntime func in the GceRuntimeMonitor (complete deletion detaches the disk)

@lucymcnatt lucymcnatt merged commit 3fd8cc2 into develop Jan 16, 2025
23 checks passed
@lucymcnatt lucymcnatt deleted the AN-356-disk-detachment branch January 16, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants