Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to upgrade framework #5383

Draft
wants to merge 11 commits into
base: 3.1
Choose a base branch
from
Draft

Conversation

dlmarion
Copy link
Contributor

@dlmarion dlmarion commented Mar 6, 2025

Modified ZooZap to add a '-prepare-for-upgrade' option which validates that the Manager lock is not held, looks for any existing FATE transactions, creates a new node in ZooKeeper that signals the instance is preparing for an upgrade, then removes all server locks. ZooZap is currently a command that can be run via the CLI and it is run from 'accumulo-cluster'. This provides several user entry points to running this command: via a new accumulo-cluster stop-for-upgrade option, a new admin command, or directly via the existing KeywordExecutable mechanism.

The ZooZap prepare-for-upgrade command is intended to be used when shutting down an older version of Accumulo to prepare for the instance to be upgraded. As such, the ZooZap changes in this commit could be backported to the 2.1 branch.

The AbstractServer constructor was modified such that if the new ZooKeeper upgrade preparation node exists, then the server process will fail. A message is logged instructing the user on how to undo the prepare-for-upgrade state to allow the older version of Accumulo to be used.

A new PreUpgradeCheck utility is included that is intended to be run using the newer version of the Accumulo software before any of the processes are started. If the new ZooKeeper upgrade preparation node does not exist, then it verifies that no FATE transactions exist in ZooKeeper and then it deletes all server locks under known ZooKeeper paths. It can then perform any ZooKeeper modifications that are necessary before starting the newer version of Accumulo software. Finally, it creates the UpgradeProgress node in ZooKeeper and deletes the upgrade preparation node. This utility could be run on it's own or as a step in a higher level admin command.

Finally, this commit changes the UpgradeCoordinator to expect the UpgradeProgress zookeeper node created by the PreUpgradeCheck utility to exist before progressing with the normal Accumulo software upgrade process that runs in the Manager

Modified ZooZap to add a '-prepare-for-upgrade' option which
validates that the Manager lock is not held, looks for any
existing FATE transactions, creates a new node in ZooKeeper
that signals the instance is preparing for an upgrade, then
removes all server locks. ZooZap is currently a command that
can be run via the CLI and it is run from 'accumulo-cluster'.
This provides several user entry points to running this command:
via a new accumulo-cluster stop-for-upgrade option, a new admin
command, or directly via the existing KeywordExecutable mechanism.

The ZooZap prepare-for-upgrade command is intended to be used when
shutting down an older version of Accumulo to prepare for the
instance to be upgraded. As such, the ZooZap changes in this commit
could be backported to the 2.1 branch.

The AbstractServer constructor was modified such that if the new
ZooKeeper upgrade preparation node exists, then the server process
will fail. A message is logged instructing the user on how to undo
the prepare-for-upgrade state to allow the older version of Accumulo
to be used.

A new PreUpgradeCheck utility is included that is intended to be
run using the newer version of the Accumulo software before any
of the processes are started. If the new ZooKeeper upgrade
preparation node does not exist, then it verifies that no FATE
transactions exist in ZooKeeper and then it deletes all server
locks under known ZooKeeper paths. It can then perform any
ZooKeeper modifications that are necessary before starting the
newer version of Accumulo software. Finally, it creates the
UpgradeProgress node in ZooKeeper and deletes the upgrade
preparation node. This utility could be run on it's own or as
a step in a higher level admin command.

Finally, this commit changes the UpgradeCoordinator to expect
the UpgradeProgress zookeeper node created by the PreUpgradeCheck
utility to exist before progressing with the normal Accumulo
software upgrade process that runs in the Manager
@dlmarion dlmarion added this to the 3.1.0 milestone Mar 6, 2025
@dlmarion dlmarion self-assigned this Mar 6, 2025
try {
final String fatePath = zkRoot + Constants.ZFATE;
// Adapted from UpgradeCoordinator.abortIfFateTransactions
final ReadOnlyTStore<PreUpgradeCheck> fate = new ZooStore<>(fatePath, zs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 4.0 will probably need to copy a minimal amount of code from 2.1/3.1 for reading the transaction ids as the persisted format has changed a lot. These changes are fine for 3.1, the 3.1 code should be able to read the 2.1 fate ids.

@Override
public String description() {
return "Utility that prepares an instance to be upgraded to a new version."
+ " This utility must be run before starting any servers.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to squeeze in some bigger picture guidance here.

Suggested change
+ " This utility must be run before starting any servers.";
+ " This utility must be run before starting any servers. This utility fits into a bigger picture of upgrade in which these steps should be followed : in the old version of Accumulo run prep-upgrade, run pre-upgrade (this utility) using the new Accumulo version, and then start the new version of the servers.";

throw new IllegalStateException("Cannot complete upgrade preparation"
+ " because FATE transactions exist. You can start a tserver, but"
+ " not the Manager, then use the shell to delete completed"
+ " transactions and fail pending or in-progress transactions."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guidance does not seems like it will work. If a fate op is failed the manager would need to be up to execute the rollback. We could advise when restarting the manager to make sure any client process that will initiate table operations like compact or bulk import are stopped. Alternatively this process could create another zookeeper flag that the manager checks on startup that disables creating new fate ops and only allows ones already started to run. Then all the old servers could be started and the fates cleaned up or allowed to complete w/o worrying about some random process in the system starting new fate ops. The prep-upgrade could delete this flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to point to some other documentation instead of trying to kick everything out in this Exception message? "go read the admin guide or javadoc entry for some upgrader code?"

The current guidance seems correct for all fate types other than failed.

@@ -72,6 +73,21 @@ protected AbstractServer(String appName, ConfigOpts opts, String[] args) {
this.hostname = siteConfig.get(Property.GENERAL_PROCESS_BIND_ADDRESS);
SecurityUtil.serverLogin(siteConfig);
context = new ServerContext(siteConfig);

final String upgradePrepNode = context.getZooKeeperRoot() + Constants.ZPREPARE_FOR_UPGRADE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an IT that veifies servers will not start when this exists. The IT could stop all servers, create the ZK flag, and then start server processes and verify they exited. Maybe the error message could be verified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added in 6c7480c. Not sure how to test the error message that's located in the server log file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into that issue when trying to test #4513.
The AccumuloClusterHarness doesn't really expose the information needed for running a specific server type and getting back the exception from the execution process.

I played around a bit yesterday with trying to use some of the MAC cluster control methods to return the process and read the process.getErrorStream() for the exception message but when the exception is thrown on server startup it's very difficult to get the info before the process dies.

Maybe there is a way to get the log file that was recently created and grep that for the Exception message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated UpgradeIT in 6adf8a6 to validate that error type and message that is thrown from AbstractServer when the ZPREPARE_FOR_UPGRADE node exists.

…ap.java

Co-authored-by: Keith Turner <kturner@apache.org>
throw new IllegalStateException("Cannot complete upgrade preparation"
+ " because FATE transactions exist. You can start a tserver, but"
+ " not the Manager, then use the shell to delete completed"
+ " transactions and fail pending or in-progress transactions."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to point to some other documentation instead of trying to kick everything out in this Exception message? "go read the admin guide or javadoc entry for some upgrader code?"

The current guidance seems correct for all fate types other than failed.

dlmarion and others added 7 commits March 11, 2025 08:01
…erver.java

Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ap.java

Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ade/PreUpgradeCheck.java

Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ade/PreUpgradeCheck.java

Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ade/PreUpgradeCheck.java

Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants