Create an admin command that performs a comprehensive check for problems #4687

keith-turner · 2024-06-20T18:30:48Z

Accumulo has at least two admin command that check for some problems. The accumulo admin checkTablets will check for a few problems related to tablets. The accumulo admin fate command can look for table locks that reference a non existent fate operation. There may be other admin command that look for problem.

Having to know which admin command to run to look for problems is a bit problematic if one does not know what problems may exists. These existing checks could be consolidated under a new command like accumulo admin check that will check as many things as possible in accumulo looking for problems. Then this single command could be run to find any problems that may exists and would be a useful thing to run periodically.

The following is a list of things this new command could check. Some of these checks already exists in different places in the accumulo code base.

Look for malformed data like :
- Zookeeper node that contains an unexpected value (like if a positive integer is expected in the value and it contains something else).
- Tablet metadata column that contains an unexpected value
- Configuration property that contains an unexpected value
Look for extraneous data like :
- Unexpected columns in the metadata table
- Unexpected nodes in zookeeper
- Unexpected files in HDFS (could be unreferenced tablet files or just junk files)
- Unexpected config properties
Look for missing data like :
- Nodes that are expected to exists in zookeeper but are not present
- Columns that are expected to exists in tablets metadata that are not present (there are a few columns every tablet is expected to have like prev row, dir, availability, etc)
- Required configuration properties that are absent, like instance.volumes
Check references like :
- Ensure table locks point to a valid table and fate operation
- Ensure tablets prev row is valid
- Ensure tablets reference files that actually exist
- Ensure tablets reference valid fate operations
Check consistency like :
- A small amount of metadata is written to DFS for each volume in Accumulo, its expected this metadata is consistent across volumes.
- Server processes that have inconsistent essential configuration.

This command could potentially find a lot of problems. Some of these problems may have automated fixes. The command could optionally automatically fix those that can be fixed. The user would need a way to control what is automatically fixed. May not want to automatically fix all problems found.

The text was updated successfully, but these errors were encountered:

keith-turner · 2024-06-20T18:38:51Z

Opened this issue after working on #4686 and trying to determine where that new functionality should go. The functionality added in #4686 could have gone under the existing admin checkTablets or admin fate command.

cshannon · 2024-06-21T11:06:13Z

This looks like it would be a really good feature to add. It would be nice to also optionally be able to configure the checks to run on some schedule and generate a report or something without having to always manually run the command. I'm not sure how you would report problems other than logs (maybe in the monitor?) but that could be useful. Of course another option is someone could just configure the command to run periodically with cron or something like that so we may not need to build in the automated running portion.

I could also see this command expanding with a few flags to make it more powerful such as being able to define which things to check, whether to automatically fix things, and probably plenty of other options that we would think of.

EdColeman · 2024-06-22T12:39:55Z

If run on a schedule, and similar to FATEs, there could be some simple metrics - maybe error count and run-time would be enough. A non-zero error count could then be used to prompt / alert to investigate further. The run time could allow for trending over time as a proxy for a performance measurement of the subsystems that are used in the check(s)

Metrics for timing sub-system checks would could add more fidelity if there are some discrete subsystems that are being used. (time to scan fate table, time to scan metadata table ...)

keith-turner · 2024-06-22T18:52:03Z

I could also see this command expanding with a few flags to make it more powerful such as being able to define which things to check

I suspect that would be needed because some checks will take a bit of time to run. If someone wants to run a specific check then they may not want to wait on everything to run.

kevinrr888 · 2024-07-26T15:19:56Z

This looks like a very interesting ticket. I would be interested in working on this, if no one else has started
@cshannon @keith-turner Were either of you planning or started working on this?

cshannon · 2024-07-26T15:22:11Z

I was not planning to anytime soon as i'm going to likely be working on gRPC stuff for a while so feel free to take a look

keith-turner · 2024-07-26T15:29:22Z

This looks like a very interesting ticket. I would be interested in working on this, if no one else has started
@cshannon @keith-turner Were either of you planning or started working on this?

No I was not, but I had some half baked thoughts about it that I will try to write up in case they are useful.

kevinrr888 · 2024-07-26T15:31:56Z

That would be good, the more info the better

ctubbsii · 2024-07-26T16:14:59Z

@kevinrr888 My suggestions would be:

Don't duplicate other tooling, like metrics, log collection/analysis, etc.
Stay under the Accumulo scope (don't try to identify problems with the whole system environment)
Focus on detection, not automatic solutions: "Observe and Report"
Use consistent terminology, and formatting
Ensure output is machine-readable
Chunk the reports into discrete and organized topics

keith-turner · 2024-07-26T16:29:45Z

@kevinrr888 here are some of the things I was thinking about.

Each check could have a well defined name that allows only that check to be run and that a user could list all possible checks. Some checks may take a long time to run and if a user wants to run a specific check then they would not want to wait on everything.

Was also thinking that internally the checks could all implement a similar interface and declare dependencies on other checks, allowing the dependencies to be run first.

So maybe we could have something like the following to list checks that could be run.

accumulo admin check list
  Check Name        Description                                                                      Depends on

  system_config     Validate the system config stored in zookeeper
  root_metadata     Checks integrity of the root tablet metadata stored in zookeeper                  system_config 
  root_table        Scans all the tablet metadata stored in the root table and checks integrity      root_metadata 
  metadata_table    Scans all the tablet metadata stored in the metadata table and checks integrity  root_table
  system_files      Checks that files in system tablet metadata exists in DFS                        root_table
  user_files        Checks that files in user tablet metadata exists in DFS                          metadata_table

Then to run checks could do something like the following that starts running all checks following the dependency graph and does not run checks if their dependency check failed. Was also thinking the command could print high level status to stdout and details about failures to stderr (so thinking details about the failure would go to check.log in example below). Ideally what ends up in check.log would be some machine readable format like json, but not really sure how this would play out.

accumulo admin check run 2> check.log

  Check Name       Status

  system_config    OK
  root_table       OK
  root_metadata    OK
  metadata_table   FAILED
  system_files     OK
  user_files       SKIPPED_DEPENDENCY_FAILED

Could somehow support a regex pattern for selecting checks to run using their well known names.

accumulo admin check run --name_pattern "*.files"  2>check.log

  Check Name       Status

  system_files     OK
  user_files       OK

In the code set of checks will be static, so we could use an enum to specify the top level set of all checks. Then the list and run commands could just operate on the all the enums. The enum could offer method to get check objects and to get dependencies. Thinking the enum will make specifying the dependcy graph (which is also static) in code really easy.

// Not sure what to name thjs
enum Check {
   SYSTEM_CONFIG,
   ROOT_TABLE,
   ROOT_METADATA,
   METADATA_TABLE,
   SYSTEM_FILES,
   USER_FILES;

   // retunrs the list of other checks the check depends on
    List<Check> getDependicies();

    // returns a well defined interface for running a check
    CheckRunner getCheckRunner();
}

Not sure if this overall structure is workable. If we can get a structure for the command laid down along with a few initial checks, then we can start adding more checks as individual PRs.

kevinrr888 · 2024-07-26T16:56:10Z

Thank you for this info
As you suggested, I can work on getting the structure laid out with the existing checks now done through this new check cmd as an initial PR, then go from there

(Part of apache#4687) - Added new `admin check` command - `admin check list` prints the checks that can be run along with their descriptions and dependencies - `admin check run` runs all the checks or a specified list of checks (explicitly provide each check, provide a regex of the checks to run, or just run all if no further args are provided). The dependencies are run first and if a checks dependency fails, the check is skipped - Current impl of `admin check run` does not do any actual work besides printing that the check is running (to ensure correct run order) and returning an `OK` status - New IT AdminCheckIT - This IT includes a verion of the Checks where no work is actually done (right now, this is equivalent to the actual implementation). This is done so correct run order, correct checks run, etc. can be verified without actually running the checks which may take a long time. More tests can be added later to test the actual check functionality when that is implemented. - No existing functionality was changed

(Part of #4687) - Added new `admin check` command - `admin check list` prints the checks that can be run along with their descriptions and dependencies - `admin check run` runs all the checks or a specified list of checks (explicitly provide each check, provide a regex of the checks to run, or just run all if no further args are provided). The dependencies are run first and if a checks dependency fails, the check is skipped - Current impl of `admin check run` does not do any actual work besides printing that the check is running (to ensure correct run order) and returning an `OK` status - New IT AdminCheckIT - This IT includes a verion of the Checks where no work is actually done (right now, this is equivalent to the actual implementation). This is done so correct run order, correct checks run, etc. can be verified without actually running the checks which may take a long time. More tests can be added later to test the actual check functionality when that is implemented. - No existing functionality was changed

kevinrr888 · 2024-09-16T15:34:15Z

Closing this as completed by #4807.
Essentially just replacing this ticket with #4892 so we don't have two tickets tracking the same thing. That ticket has some added info and links to this ticket as well.

keith-turner added this to Accumulo: Elasticity Jun 20, 2024

keith-turner converted this from a draft issue Jun 20, 2024

ctubbsii added this to the 4.0.0 milestone Jul 12, 2024

kevinrr888 self-assigned this Jul 26, 2024

kevinrr888 mentioned this issue Aug 13, 2024

New admin check command structure implementation #4806

Closed

ctubbsii linked a pull request Aug 14, 2024 that will close this issue

New admin check command structure implementation #4806

Closed

kevinrr888 mentioned this issue Aug 14, 2024

New admin check command structure implementation #4807

Merged

ctubbsii modified the milestones: 4.0.0, 3.1.0 Aug 15, 2024

ctubbsii linked a pull request Aug 15, 2024 that will close this issue

New admin check command structure implementation #4807

Merged

kevinrr888 mentioned this issue Sep 16, 2024

admin check run command functionality #4892

Open

10 tasks

kevinrr888 closed this as completed Sep 16, 2024

github-project-automation bot moved this to ✅ Done in Accumulo: Elasticity Sep 16, 2024

kevinrr888 mentioned this issue Oct 8, 2024

Checks for problems in Accumulo #4957

Merged

kevinrr888 mentioned this issue Dec 10, 2024

AdminCheckIT needs to test failure cases #5163

Closed

ctubbsii modified the milestones: 3.1.0, 4.0.0 Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an admin command that performs a comprehensive check for problems #4687

Create an admin command that performs a comprehensive check for problems #4687

keith-turner commented Jun 20, 2024 •

edited

Loading

keith-turner commented Jun 20, 2024

cshannon commented Jun 21, 2024

EdColeman commented Jun 22, 2024

keith-turner commented Jun 22, 2024

kevinrr888 commented Jul 26, 2024

cshannon commented Jul 26, 2024

keith-turner commented Jul 26, 2024

kevinrr888 commented Jul 26, 2024

ctubbsii commented Jul 26, 2024

keith-turner commented Jul 26, 2024 •

edited

Loading

kevinrr888 commented Jul 26, 2024

kevinrr888 commented Sep 16, 2024

Create an admin command that performs a comprehensive check for problems #4687

Create an admin command that performs a comprehensive check for problems #4687

Comments

keith-turner commented Jun 20, 2024 • edited Loading

keith-turner commented Jun 20, 2024

cshannon commented Jun 21, 2024

EdColeman commented Jun 22, 2024

keith-turner commented Jun 22, 2024

kevinrr888 commented Jul 26, 2024

cshannon commented Jul 26, 2024

keith-turner commented Jul 26, 2024

kevinrr888 commented Jul 26, 2024

ctubbsii commented Jul 26, 2024

keith-turner commented Jul 26, 2024 • edited Loading

kevinrr888 commented Jul 26, 2024

kevinrr888 commented Sep 16, 2024

keith-turner commented Jun 20, 2024 •

edited

Loading

keith-turner commented Jul 26, 2024 •

edited

Loading