Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback functionality of Unikernels #111

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Rollback functionality of Unikernels #111

wants to merge 7 commits into from

Conversation

PizieDust
Copy link
Collaborator

This PR implements Rollback functionality for unikernel updates.
The methodology is that, if we try to update a unikernel, we store some metadata about the running unikernel to disk and then attempt the update. If this fails, we call the rollback function with the previous running build.

@PizieDust PizieDust requested review from hannesm and reynir February 25, 2025 12:41
@hannesm
Copy link
Contributor

hannesm commented Mar 2, 2025

Great to see this PR.

I struggle with the naming "current_build"and "latest_build". Can we have some more descriptive terms?

Why should we decode and encode the config to/from json, if albatross already provides asn1 decoders and encoders for that data?

You now store a list of "unikernel_updates" in the user configuration. Is this list ever getting smaller? I'm not sure what we did for cookies, but I'm a bit scared of ever-growing lists that are stored to persistent storage.

Semantics wise: I'm not sure what is expected to happen when you update twice? I guess I don't understand when a rollback is triggered, and whether in the list of updates stored persistently a unikernel identified by its name can be present twice and if yes why.

@PizieDust
Copy link
Collaborator Author

PizieDust commented Mar 3, 2025

Thank you

Great to see this PR.

I struggle with the naming "current_build"and "latest_build". Can we have some more descriptive terms?

Yes we can change the names But just to explain a bit, by current_build I mean the version of the unikernel currently running and by latest_build I mean the version that we want to upgrade to.

Why should we decode and encode the config to/from json, if albatross already provides asn1 decoders and encoders for that data?

when storing the metadata for a unikernel before updating it, i figured its important to store the config so we can always rollback using the same configurations as before. I'll look more into using the asn1 decoders/encoders and update the pr :)

You now store a list of "unikernel_updates" in the user configuration. Is this list ever getting smaller? I'm not sure what we did for cookies, but I'm a bit scared of ever-growing lists that are stored to persistent storage.

Semantics wise: I'm not sure what is expected to happen when you update twice? I guess I don't understand when a rollback is triggered, and whether in the list of updates stored persistently a unikernel identified by its name can be present twice and if yes why.

For this list, only one instance of a unikernel can be stored. so only one instance of a unikernel can exist in the list. If we update twice, the list will contain data related to the latest update only.

Also a rollback is triggered automatically if an update fails because of an unavoidable error. By this I mean, for instance if the user provides a config which will mismatch the manifest, no rollback will be performed, since the force-create-unikernel was not triggered. The rollback will mostly only be triggered in cases were force-create-unikernel fails.

`OK
| Error (`Msg _err, _status) ->
unikernel_rollback ~unikernel_name albatross store
http_client reqd user)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this here means that if anything in process_unikernel_change fails, we do a rollback. Is this desired?

From reading the code, I have the feeling that process_unikernel_change does what it tells: quite a lot, including "updating the store" (which may fail), a failed HTTP request to builds.robur.coop, manifest devices match, force_create, Albatross_json.res.

Now, only a failed force_create should potentially lead to a rollback, no? Because if there's an issue in updating the store, a failed HTTP request, when the manifest mismatches -- we should present the error to the user -- and since nothing has been done in respect to albatross, we should leave the existing unikernel running. Or am I getting this wrong?

Maybe it makes sense to separate the preconditions for deploying a unikernel from the actual deployment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you have it correctly.
I will separate the function into different purpose-focused functions.

The final rollback should only be triggered if force_create fails.

@hannesm
Copy link
Contributor

hannesm commented Mar 3, 2025

I struggle with the naming "current_build"and "latest_build". Can we have some more descriptive terms?

Yes we can change the names But just to explain a bit, by current_build I mean the version of the unikernel currently running and by latest_build I mean the version that we want to upgrade to.

What about currently_running_unikernel and to_be_updated_unikernel (I'm not too happy with these names).

Why should we decode and encode the config to/from json, if albatross already provides asn1 decoders and encoders for that data?

when storing the metadata for a unikernel before updating it, i figured its important to store the config so we can always rollback using the same configurations as before. I'll look more into using the asn1 decoders/encoders and update the pr :)

We can as well do that in a later PR. My main question is whether there's any reason to use json for everyhing.

You now store a list of "unikernel_updates" in the user configuration. Is this list ever getting smaller? I'm not sure what we did for cookies, but I'm a bit scared of ever-growing lists that are stored to persistent storage.
Semantics wise: I'm not sure what is expected to happen when you update twice? I guess I don't understand when a rollback is triggered, and whether in the list of updates stored persistently a unikernel identified by its name can be present twice and if yes why.

For this list, only one instance of a unikernel can be stored. so only one instance of a unikernel can exist in the list. If we update twice, the list will contain data related to the latest update only.

Ok. But the list will never be cleared when the update was successful.

Also a rollback is triggered automatically if an update fails because of an unavoidable error. By this I mean, for instance if the user provides a config which will mismatch the manifest, no rollback will be performed, since the force-create-unikernel was not triggered. The rollback will mostly only be triggered in cases were force-create-unikernel fails.

I don't find what you say here in the code. I commented on the code where I think that a rollback is triggered even though it shouldn't.

@PizieDust
Copy link
Collaborator Author

What about currently_running_unikernel and to_be_updated_unikernel (I'm not too happy with these names).

These names are fine, much better than what I used :)

We can as well do that in a later PR. My main question is whether there's any reason to use json for everyhing.

No we don't have to use json. I mainly used it since most of the other processes already rely on json so I wanted to reuse as much as I can

Ok. But the list will never be cleared when the update was successful.

I will update the PR to clear the list :)

@hannesm
Copy link
Contributor

hannesm commented Mar 3, 2025

Hmm, and now I struggle with: rollback being triggered when unikernel_force_create fails. But when does it fail, and why should a rollback then be beneficial?

Isn't there a lot checking done before the unikernel_force_create, so that it is unlikely to fail? i.e. manifest, we could as well look into resource usage / policies.

Wasn't the purpose of a rollback that either there's a service not responding or an operator figures shortly after the update that something is fishy? I currently think this "automatic rollback" based on the failure of "unikernel_force_create" is not super-useful.

@hannesm
Copy link
Contributor

hannesm commented Mar 3, 2025

Another thing is that we need to ensure to never rollback while we do a rollback - to avoid (infinite) rollback loops.

@PizieDust
Copy link
Collaborator Author

Hmm, and now I struggle with: rollback being triggered when unikernel_force_create fails. But when does it fail, and why should a rollback then be beneficial?

Isn't there a lot checking done before the unikernel_force_create, so that it is unlikely to fail? i.e. manifest, we could as well look into resource usage / policies.

Wasn't the purpose of a rollback that either there's a service not responding or an operator figures shortly after the update that something is fishy? I currently think this "automatic rollback" based on the failure of "unikernel_force_create" is not super-useful.

To be fair, I haven't had a single update fail that required a rollback. So in this case it's correct that rolling back during unikernel-force-create is not super useful.

Indeed all the checks are done before the upgrade is done.

@hannesm
Copy link
Contributor

hannesm commented Mar 3, 2025

Ok, cool. So what about in a first version to have a button next to "update" that is labelled "rollback" (in case there was an update within the last 15 minutes or so)?

Let's do the automation than subsequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants