[walrus] make reads/writes more resilient during epoch changes #67

williamrobertson13 · 2025-02-16T21:00:59Z

Description

This PR makes the read flow more resilient to epoch changes and adds some basic logic for doing retries when we detect an epoch change is progress. For a list of changes here:

Reads should be served from the previous committee during an epoch change if the blob was certified earlier than the current epoch. This is because nodes from the current committee might still be receiving shards from the previous committee, so the previous committee has the most up-to-date state
To get the epoch that a blob was certified at, we need to retrieve the verified blob status from the storage nodes. There's quite a bit of complicated logic here, but basically you need to get quorum on blob status responses so you can be assured you have at least one honest node reporting the correct blob status. After getting quorum, you need to aggregate the unique status responses by aggregate weight and if you have f + 1 total weight you know you have the verified status. There can be inconsistencies between storage nodes on the status of a blob, so you additionally need to sort the statuses by ordering of "latest in the blob lifecycle" to "earliest in the blob lifecycle" as the latest reported statuses are the more correct statuses
This introduces a retryOnPossibleEpochChange util to the WalrusClient that'll nuke the object cache and retry a function once if the function failed possibly due to an epoch change. Since the client is going to be long lived in most cases, we should add probabllllllly add a caching layer (or modify the underlying cache map in the data loader) in a follow-up to invalidate the Sui object cache leading up to epoch changes.

Test plan

How did you test the new or updated feature?

I tested the general logic locally with 30s epoch changes, but I haven't found a way to simulate changing committees or some of the edge cases that warrant doing all the complicated blob status retrieval logic. I'll ask the Walrus core folks for help with this

vercel · 2025-02-16T21:01:05Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
sui-typescript-docs	⬜️ Skipped (Inspect)			Feb 20, 2025 7:31pm

… wrobertson/epoch_changes

hayes-mysten

This looks pretty good, but it looks like we only ever reset state when reading blobs, do we want to do something similar during writes so that we can get an updated committee without reading?

hayes-mysten · 2025-02-19T18:26:59Z

packages/walrus/src/client.ts

+					error instanceof NotEnoughSliversReceivedError ||
+					error instanceof NotEnoughBlobConfirmationsError
+				) {
+					this.#objectLoader.clearAll();


I think we should reset #nodes too (and any other cached data), we should probably just have a method that resets all the client state

packages/walrus/src/client.ts

williamrobertson13 · 2025-02-20T20:19:40Z

This looks pretty good, but it looks like we only ever reset state when reading blobs, do we want to do something similar during writes so that we can get an updated committee without reading?

yep so the only retry case for blob writes AFAIK is if sliver writes fail, I'll double check afterward to see if there's anything we should do if the metadata writes fail as well

williamrobertson13 self-assigned this Feb 16, 2025

williamrobertson13 requested a review from a team as a code owner February 16, 2025 21:01

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 16, 2025 21:01 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 16, 2025 23:03 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 16, 2025 23:03 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 17, 2025 03:24 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 17, 2025 03:24 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 18, 2025 05:43 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 05:43 — with GitHub Actions Inactive

williamrobertson13 changed the title ~~wip~~ [walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes Feb 18, 2025

williamrobertson13 changed the title ~~[walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes~~ [wip-ish][walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes Feb 18, 2025

vercel bot temporarily deployed to Preview February 18, 2025 16:26 Inactive

williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 18, 2025 16:26 — with GitHub Actions Failure

vercel bot temporarily deployed to Preview February 18, 2025 16:36 Inactive

williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 18, 2025 16:36 — with GitHub Actions Failure

vercel bot temporarily deployed to Preview February 18, 2025 17:47 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 17:47 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 18, 2025 18:02 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 18:02 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 18, 2025 18:25 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 18:25 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 18, 2025 18:48 Inactive

williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 18, 2025 18:48 — with GitHub Actions Failure

werk

9f27a67

williamrobertson13 force-pushed the wrobertson/epoch_changes branch from e8166be to 9f27a67 Compare February 18, 2025 18:48

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 18:48 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 18, 2025 18:48 Inactive

williamrobertson13 changed the title ~~[wip-ish][walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes~~ [walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes Feb 18, 2025

williamrobertson13 changed the title ~~[walrus] add retry logic for possible epoch failures + make reads more resilient during epoch changes~~ [walrus] make reads/writes more resilient during epoch changes Feb 18, 2025

williamrobertson13 changed the base branch from main to wrobertson/get_blob_status February 18, 2025 21:13

Merge remote-tracking branch 'origin/wrobertson/get_blob_status' into…

991bec0

… wrobertson/epoch_changes

vercel bot temporarily deployed to Preview February 18, 2025 21:24 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 18, 2025 21:24 — with GitHub Actions Inactive

williamrobertson13 changed the title ~~[wip][walrus] make reads/writes more resilient during epoch changes~~ [walrus] make reads/writes more resilient during epoch changes Feb 19, 2025

hayes-mysten approved these changes Feb 19, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into wrobertson/epoch_changes

57d0754

vercel bot deployed to Preview February 19, 2025 20:58 View deployment

williamrobertson13 changed the base branch from wrobertson/get_blob_status to main February 19, 2025 21:00

Jordan-Mysten reviewed Feb 19, 2025

View reviewed changes

packages/walrus/src/client.ts Outdated Show resolved Hide resolved

Jordan-Mysten reviewed Feb 19, 2025

View reviewed changes

packages/walrus/src/client.ts Outdated Show resolved Hide resolved

PR feedback

4370a01

vercel bot temporarily deployed to Preview February 20, 2025 19:28 Inactive

williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 20, 2025 19:28 — with GitHub Actions Failure

f

96255e5

vercel bot temporarily deployed to Preview February 20, 2025 19:28 Inactive

williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 20, 2025 19:28 — with GitHub Actions Failure

fix

2520e6f

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 19:29 — with GitHub Actions Inactive

vercel bot temporarily deployed to Preview February 20, 2025 19:29 Inactive

fix

47d360a

vercel bot temporarily deployed to Preview February 20, 2025 19:30 Inactive

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 19:30 — with GitHub Actions Inactive

fix

3abf5ab

vercel bot temporarily deployed to Preview February 20, 2025 19:31 Inactive

williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 20, 2025 19:31 — with GitHub Actions Failure

williamrobertson13 had a problem deploying to sui-typescript-aws-kms-test-env February 20, 2025 20:24 — with GitHub Actions Failure

williamrobertson13 temporarily deployed to sui-typescript-aws-kms-test-env February 20, 2025 20:36 — with GitHub Actions Inactive

williamrobertson13 merged commit 69844ae into main Feb 20, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[walrus] make reads/writes more resilient during epoch changes #67

[walrus] make reads/writes more resilient during epoch changes #67

williamrobertson13 commented Feb 16, 2025 •

edited

Loading

vercel bot commented Feb 16, 2025 •

edited

Loading

hayes-mysten left a comment

hayes-mysten Feb 19, 2025

williamrobertson13 commented Feb 20, 2025

[walrus] make reads/writes more resilient during epoch changes #67

[walrus] make reads/writes more resilient during epoch changes #67

Conversation

williamrobertson13 commented Feb 16, 2025 • edited Loading

Description

Test plan

vercel bot commented Feb 16, 2025 • edited Loading

hayes-mysten left a comment

Choose a reason for hiding this comment

hayes-mysten Feb 19, 2025

Choose a reason for hiding this comment

williamrobertson13 commented Feb 20, 2025

williamrobertson13 commented Feb 16, 2025 •

edited

Loading

vercel bot commented Feb 16, 2025 •

edited

Loading