-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[walrus] make reads/writes more resilient during epoch changes #67
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
e8166be
to
9f27a67
Compare
… wrobertson/epoch_changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, but it looks like we only ever reset state when reading blobs, do we want to do something similar during writes so that we can get an updated committee without reading?
packages/walrus/src/client.ts
Outdated
error instanceof NotEnoughSliversReceivedError || | ||
error instanceof NotEnoughBlobConfirmationsError | ||
) { | ||
this.#objectLoader.clearAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reset #nodes too (and any other cached data), we should probably just have a method that resets all the client state
yep so the only retry case for blob writes AFAIK is if sliver writes fail, I'll double check afterward to see if there's anything we should do if the metadata writes fail as well |
Description
This PR makes the read flow more resilient to epoch changes and adds some basic logic for doing retries when we detect an epoch change is progress. For a list of changes here:
Reads should be served from the previous committee during an epoch change if the blob was certified earlier than the current epoch. This is because nodes from the current committee might still be receiving shards from the previous committee, so the previous committee has the most up-to-date state
To get the epoch that a blob was certified at, we need to retrieve the verified blob status from the storage nodes. There's quite a bit of complicated logic here, but basically you need to get quorum on blob status responses so you can be assured you have at least one honest node reporting the correct blob status. After getting quorum, you need to aggregate the unique status responses by aggregate weight and if you have
f + 1
total weight you know you have the verified status. There can be inconsistencies between storage nodes on the status of a blob, so you additionally need to sort the statuses by ordering of "latest in the blob lifecycle" to "earliest in the blob lifecycle" as the latest reported statuses are the more correct statusesThis introduces a
retryOnPossibleEpochChange
util to the WalrusClient that'll nuke the object cache and retry a function once if the function failed possibly due to an epoch change. Since the client is going to be long lived in most cases, we should add probabllllllly add a caching layer (or modify the underlying cache map in the data loader) in a follow-up to invalidate the Sui object cache leading up to epoch changes.Test plan
How did you test the new or updated feature?