Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[walrus] make reads/writes more resilient during epoch changes #67

Merged
merged 9 commits into from
Feb 20, 2025

Conversation

williamrobertson13
Copy link
Contributor

@williamrobertson13 williamrobertson13 commented Feb 16, 2025

Description

This PR makes the read flow more resilient to epoch changes and adds some basic logic for doing retries when we detect an epoch change is progress. For a list of changes here:

  • Reads should be served from the previous committee during an epoch change if the blob was certified earlier than the current epoch. This is because nodes from the current committee might still be receiving shards from the previous committee, so the previous committee has the most up-to-date state

  • To get the epoch that a blob was certified at, we need to retrieve the verified blob status from the storage nodes. There's quite a bit of complicated logic here, but basically you need to get quorum on blob status responses so you can be assured you have at least one honest node reporting the correct blob status. After getting quorum, you need to aggregate the unique status responses by aggregate weight and if you have f + 1 total weight you know you have the verified status. There can be inconsistencies between storage nodes on the status of a blob, so you additionally need to sort the statuses by ordering of "latest in the blob lifecycle" to "earliest in the blob lifecycle" as the latest reported statuses are the more correct statuses

  • This introduces a retryOnPossibleEpochChange util to the WalrusClient that'll nuke the object cache and retry a function once if the function failed possibly due to an epoch change. Since the client is going to be long lived in most cases, we should add probabllllllly add a caching layer (or modify the underlying cache map in the data loader) in a follow-up to invalidate the Sui object cache leading up to epoch changes.

Test plan

How did you test the new or updated feature?

  • I tested the general logic locally with 30s epoch changes, but I haven't found a way to simulate changing committees or some of the edge cases that warrant doing all the complicated blob status retrieval logic. I'll ask the Walrus core folks for help with this

@williamrobertson13 williamrobertson13 self-assigned this Feb 16, 2025
@williamrobertson13 williamrobertson13 requested a review from a team as a code owner February 16, 2025 21:01
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 16, 2025 21:01 — with GitHub Actions Inactive
Copy link

vercel bot commented Feb 16, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
sui-typescript-docs ⬜️ Skipped (Inspect) Feb 20, 2025 7:31pm

@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 16, 2025 23:03 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 17, 2025 03:24 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 05:43 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 changed the title wip [walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes Feb 18, 2025
@williamrobertson13 williamrobertson13 changed the title [walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes [wip-ish][walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes Feb 18, 2025
@williamrobertson13 williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 18, 2025 16:26 — with GitHub Actions Failure
@williamrobertson13 williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 18, 2025 16:36 — with GitHub Actions Failure
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 17:47 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 18:02 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 18:25 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 18, 2025 18:48 — with GitHub Actions Failure
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 18:48 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 changed the title [wip-ish][walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes [walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes Feb 18, 2025
@williamrobertson13 williamrobertson13 changed the title [walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes [walrus] make reads/writes more resilient during epoch changes Feb 18, 2025
@williamrobertson13 williamrobertson13 changed the base branch from main to wrobertson/get_blob_status February 18, 2025 21:13
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 21:24 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 changed the title [wip][walrus] make reads/writes more resilient during epoch changes [walrus] make reads/writes more resilient during epoch changes Feb 19, 2025
Copy link
Contributor

@hayes-mysten hayes-mysten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but it looks like we only ever reset state when reading blobs, do we want to do something similar during writes so that we can get an updated committee without reading?

error instanceof NotEnoughSliversReceivedError ||
error instanceof NotEnoughBlobConfirmationsError
) {
this.#objectLoader.clearAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reset #nodes too (and any other cached data), we should probably just have a method that resets all the client state

@williamrobertson13 williamrobertson13 changed the base branch from wrobertson/get_blob_status to main February 19, 2025 21:00
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 19:29 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 19:30 — with GitHub Actions Inactive
@williamrobertson13
Copy link
Contributor Author

This looks pretty good, but it looks like we only ever reset state when reading blobs, do we want to do something similar during writes so that we can get an updated committee without reading?

yep so the only retry case for blob writes AFAIK is if sliver writes fail, I'll double check afterward to see if there's anything we should do if the metadata writes fail as well

@williamrobertson13 williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 20, 2025 20:24 — with GitHub Actions Failure
@williamrobertson13 williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 20:36 — with GitHub Actions Inactive
@williamrobertson13 williamrobertson13 merged commit 69844ae into main Feb 20, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants