-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to upgrade framework #5383
base: 3.1
Are you sure you want to change the base?
Conversation
Modified ZooZap to add a '-prepare-for-upgrade' option which validates that the Manager lock is not held, looks for any existing FATE transactions, creates a new node in ZooKeeper that signals the instance is preparing for an upgrade, then removes all server locks. ZooZap is currently a command that can be run via the CLI and it is run from 'accumulo-cluster'. This provides several user entry points to running this command: via a new accumulo-cluster stop-for-upgrade option, a new admin command, or directly via the existing KeywordExecutable mechanism. The ZooZap prepare-for-upgrade command is intended to be used when shutting down an older version of Accumulo to prepare for the instance to be upgraded. As such, the ZooZap changes in this commit could be backported to the 2.1 branch. The AbstractServer constructor was modified such that if the new ZooKeeper upgrade preparation node exists, then the server process will fail. A message is logged instructing the user on how to undo the prepare-for-upgrade state to allow the older version of Accumulo to be used. A new PreUpgradeCheck utility is included that is intended to be run using the newer version of the Accumulo software before any of the processes are started. If the new ZooKeeper upgrade preparation node does not exist, then it verifies that no FATE transactions exist in ZooKeeper and then it deletes all server locks under known ZooKeeper paths. It can then perform any ZooKeeper modifications that are necessary before starting the newer version of Accumulo software. Finally, it creates the UpgradeProgress node in ZooKeeper and deletes the upgrade preparation node. This utility could be run on it's own or as a step in a higher level admin command. Finally, this commit changes the UpgradeCoordinator to expect the UpgradeProgress zookeeper node created by the PreUpgradeCheck utility to exist before progressing with the normal Accumulo software upgrade process that runs in the Manager
try { | ||
final String fatePath = zkRoot + Constants.ZFATE; | ||
// Adapted from UpgradeCoordinator.abortIfFateTransactions | ||
final ReadOnlyTStore<PreUpgradeCheck> fate = new ZooStore<>(fatePath, zs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 4.0 will probably need to copy a minimal amount of code from 2.1/3.1 for reading the transaction ids as the persisted format has changed a lot. These changes are fine for 3.1, the 3.1 code should be able to read the 2.1 fate ids.
@Override | ||
public String description() { | ||
return "Utility that prepares an instance to be upgraded to a new version." | ||
+ " This utility must be run before starting any servers."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to squeeze in some bigger picture guidance here.
+ " This utility must be run before starting any servers."; | |
+ " This utility must be run before starting any servers. This utility fits into a bigger picture of upgrade in which these steps should be followed : in the old version of Accumulo run prep-upgrade, run pre-upgrade (this utility) using the new Accumulo version, and then start the new version of the servers."; |
server/base/src/main/java/org/apache/accumulo/server/util/ZooZap.java
Outdated
Show resolved
Hide resolved
throw new IllegalStateException("Cannot complete upgrade preparation" | ||
+ " because FATE transactions exist. You can start a tserver, but" | ||
+ " not the Manager, then use the shell to delete completed" | ||
+ " transactions and fail pending or in-progress transactions." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guidance does not seems like it will work. If a fate op is failed the manager would need to be up to execute the rollback. We could advise when restarting the manager to make sure any client process that will initiate table operations like compact or bulk import are stopped. Alternatively this process could create another zookeeper flag that the manager checks on startup that disables creating new fate ops and only allows ones already started to run. Then all the old servers could be started and the fates cleaned up or allowed to complete w/o worrying about some random process in the system starting new fate ops. The prep-upgrade could delete this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to point to some other documentation instead of trying to kick everything out in this Exception message? "go read the admin guide or javadoc entry for some upgrader code?"
The current guidance seems correct for all fate types other than failed.
@@ -72,6 +73,21 @@ protected AbstractServer(String appName, ConfigOpts opts, String[] args) { | |||
this.hostname = siteConfig.get(Property.GENERAL_PROCESS_BIND_ADDRESS); | |||
SecurityUtil.serverLogin(siteConfig); | |||
context = new ServerContext(siteConfig); | |||
|
|||
final String upgradePrepNode = context.getZooKeeperRoot() + Constants.ZPREPARE_FOR_UPGRADE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add an IT that veifies servers will not start when this exists. The IT could stop all servers, create the ZK flag, and then start server processes and verify they exited. Maybe the error message could be verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added in 6c7480c. Not sure how to test the error message that's located in the server log file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into that issue when trying to test #4513.
The AccumuloClusterHarness doesn't really expose the information needed for running a specific server type and getting back the exception from the execution process.
I played around a bit yesterday with trying to use some of the MAC cluster control methods to return the process and read the process.getErrorStream()
for the exception message but when the exception is thrown on server startup it's very difficult to get the info before the process dies.
Maybe there is a way to get the log file that was recently created and grep that for the Exception message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated UpgradeIT in 6adf8a6 to validate that error type and message that is thrown from AbstractServer when the ZPREPARE_FOR_UPGRADE node exists.
…ap.java Co-authored-by: Keith Turner <kturner@apache.org>
server/base/src/main/java/org/apache/accumulo/server/AbstractServer.java
Outdated
Show resolved
Hide resolved
server/base/src/main/java/org/apache/accumulo/server/util/ZooZap.java
Outdated
Show resolved
Hide resolved
throw new IllegalStateException("Cannot complete upgrade preparation" | ||
+ " because FATE transactions exist. You can start a tserver, but" | ||
+ " not the Manager, then use the shell to delete completed" | ||
+ " transactions and fail pending or in-progress transactions." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to point to some other documentation instead of trying to kick everything out in this Exception message? "go read the admin guide or javadoc entry for some upgrader code?"
The current guidance seems correct for all fate types other than failed.
server/base/src/main/java/org/apache/accumulo/server/util/upgrade/PreUpgradeCheck.java
Outdated
Show resolved
Hide resolved
server/base/src/main/java/org/apache/accumulo/server/util/upgrade/PreUpgradeCheck.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/upgrade/UpgradeProgressTrackerIT.java
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/upgrade/UpgradeProgressTrackerIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/upgrade/UpgradeProgressTrackerIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/upgrade/UpgradeProgressTrackerIT.java
Outdated
Show resolved
Hide resolved
…erver.java Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ap.java Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ade/PreUpgradeCheck.java Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ade/PreUpgradeCheck.java Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
…ade/PreUpgradeCheck.java Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
Modified ZooZap to add a '-prepare-for-upgrade' option which validates that the Manager lock is not held, looks for any existing FATE transactions, creates a new node in ZooKeeper that signals the instance is preparing for an upgrade, then removes all server locks. ZooZap is currently a command that can be run via the CLI and it is run from 'accumulo-cluster'. This provides several user entry points to running this command: via a new accumulo-cluster stop-for-upgrade option, a new admin command, or directly via the existing KeywordExecutable mechanism.
The ZooZap prepare-for-upgrade command is intended to be used when shutting down an older version of Accumulo to prepare for the instance to be upgraded. As such, the ZooZap changes in this commit could be backported to the 2.1 branch.
The AbstractServer constructor was modified such that if the new ZooKeeper upgrade preparation node exists, then the server process will fail. A message is logged instructing the user on how to undo the prepare-for-upgrade state to allow the older version of Accumulo to be used.
A new PreUpgradeCheck utility is included that is intended to be run using the newer version of the Accumulo software before any of the processes are started. If the new ZooKeeper upgrade preparation node does not exist, then it verifies that no FATE transactions exist in ZooKeeper and then it deletes all server locks under known ZooKeeper paths. It can then perform any ZooKeeper modifications that are necessary before starting the newer version of Accumulo software. Finally, it creates the UpgradeProgress node in ZooKeeper and deletes the upgrade preparation node. This utility could be run on it's own or as a step in a higher level admin command.
Finally, this commit changes the UpgradeCoordinator to expect the UpgradeProgress zookeeper node created by the PreUpgradeCheck utility to exist before progressing with the normal Accumulo software upgrade process that runs in the Manager