forked from autowarefoundation/autoware.universe
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add autoware_node_death_monitor package for monitoring node crashes #1786
Draft
kyoichi-sugahara
wants to merge
4
commits into
tier4/main
Choose a base branch
from
feat/add_node_death_monitor
base: tier4/main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
8ef11c7
feat: add autoware_node_death_monitor package for monitoring node cra…
kyoichi-sugahara 572427d
feat: change from node_death_monitor from process_alive_monitor
kyoichi-sugahara b0dae02
feat: add unimplemented features section to README for heartbeat moni…
kyoichi-sugahara a847909
fix: correct rclcpp::ok usage in ProcessAliveMonitor loop condition
kyoichi-sugahara File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
cmake_minimum_required(VERSION 3.14) | ||
project(autoware_node_death_monitor) | ||
|
||
find_package(autoware_cmake REQUIRED) | ||
autoware_package() | ||
|
||
ament_auto_add_library(${PROJECT_NAME} SHARED | ||
src/autoware_node_death_monitor.cpp | ||
) | ||
|
||
rclcpp_components_register_node(${PROJECT_NAME} | ||
PLUGIN "autoware::node_death_monitor::NodeDeathMonitor" | ||
EXECUTABLE ${PROJECT_NAME}_node) | ||
|
||
ament_auto_package(INSTALL_TO_SHARE | ||
config | ||
launch | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# autoware_node_death_monitor | ||
|
||
This package provides a monitoring node that detects ROS 2 node crashes by analyzing `launch.log` files, rather than subscribing to `/rosout` logs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. launch.logにメッセージが出力されることって保証されてましたっけ?もしyesならREADMEに「ROS2の仕組みとして必ずlounch.logに出力される」とか、もしくは「ooオプションがONになっていることを確認すること」とかを記載したい。 |
||
|
||
--- | ||
|
||
## Overview | ||
|
||
- **Node name**: `autoware_node_death_monitor` | ||
- **Monitored file**: `launch.log` | ||
- **Detected event**: Looks for lines containing the substring `"process has died"` and extracts the node name and exit code. | ||
|
||
When a crash or unexpected shutdown occurs, `ros2 launch` typically outputs a line in `launch.log` such as: | ||
|
||
```bash | ||
[ERROR] [node_name-1]: process has died [pid 12345, exit code 139, cmd '...'] | ||
``` | ||
|
||
The `autoware_node_death_monitor` node continuously reads the latest `launch.log` file, detects these messages, and logs a warning or marks the node as "dead." | ||
|
||
--- | ||
|
||
## How it Works | ||
|
||
1. **Find `launch.log`**: | ||
- First, checks the `ROS_LOG_DIR` environment variable. | ||
- If not set, falls back to `~/.ros/log`. | ||
- Identifies the latest log directory based on modification time. | ||
2. **Monitor `launch.log`**: | ||
- Reads the file from the last known position to detect new log entries. | ||
- Looks for lines containing `"process has died"`. | ||
- Extracts the node name and exit code. | ||
3. **Filtering**: | ||
- **Ignored node names**: Nodes matching patterns in `ignore_node_names` are skipped. | ||
- **Ignored exit codes**: Logs with ignored exit codes are not flagged as errors. | ||
4. **Regular Updates**: | ||
- A timer periodically reads new entries from `launch.log`. | ||
- Dead nodes are reported in the logs. (will be changed to publish diagnostics) | ||
|
||
--- | ||
|
||
## Parameters | ||
|
||
| Parameter Name | Type | Default | Description | | ||
| ------------------- | ---------- | ----------------- | ---------------------------------------------------------- | | ||
| `ignore_node_names` | `string[]` | `[]` (empty list) | Node name patterns to ignore. E.g., `['rviz2']`. | | ||
| `ignore_exit_codes` | `int[]` | `[]` (empty list) | Exit codes to ignore (e.g., `0` or `130` for normal exit). | | ||
| `check_interval` | `double` | `1.0` | Timer interval (seconds) for scanning the log file. | | ||
| `enable_debug` | `bool` | `false` | Enables debug logging for detailed output. | | ||
|
||
Example **`autoware_node_death_monitor.param.yaml`**: | ||
|
||
```yaml | ||
autoware_node_death_monitor: | ||
ros__parameters: | ||
ignore_node_names: | ||
- rviz2 | ||
- teleop_twist_joy | ||
ignore_exit_codes: | ||
- 0 | ||
- 130 | ||
check_interval: 1.0 | ||
enable_debug: false | ||
``` | ||
|
||
--- | ||
|
||
## Limitations | ||
|
||
- **後で書く**: TBD. | ||
- **Robust Monitoring**: Works alongside systemd, supervisord, or other process supervisors for enhanced fault detection. | ||
|
||
--- |
18 changes: 18 additions & 0 deletions
18
system/autoware_node_death_monitor/config/autoware_node_death_monitor.param.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
/**: | ||
ros__parameters: | ||
# Node names to exclude from monitoring (Note: be careful with the "[node_name-#]" format) | ||
# Example: Do not issue a warning if rviz2 crashes. | ||
ignore_node_names: | ||
- rviz2 | ||
|
||
# Exit codes to exclude from monitoring (e.g., Ctrl+C) | ||
# Example: 0, 130 are considered normal exits and not treated as errors. | ||
ignore_exit_codes: | ||
- 0 | ||
- 130 | ||
|
||
# Check interval (seconds) | ||
check_interval: 1.0 | ||
|
||
# Enable/disable debug output | ||
enable_debug: false |
72 changes: 72 additions & 0 deletions
72
...re_node_death_monitor/include/autoware_node_death_monitor/autoware_node_death_monitor.hpp
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
// Copyright 2025 Tier IV, Inc. | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
#ifndef AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_ | ||
#define AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_ | ||
|
||
#include "rclcpp/rclcpp.hpp" | ||
|
||
#include <filesystem> | ||
#include <string> | ||
#include <unordered_map> | ||
#include <vector> | ||
|
||
namespace autoware::node_death_monitor | ||
{ | ||
|
||
class NodeDeathMonitor : public rclcpp::Node | ||
{ | ||
public: | ||
/** | ||
* @brief Constructor for NodeDeathMonitor | ||
* @param options Node options for configuration | ||
*/ | ||
explicit NodeDeathMonitor(const rclcpp::NodeOptions & options); | ||
|
||
private: | ||
/** | ||
* @brief Read and process new content appended to launch.log | ||
*/ | ||
void read_launch_log_diff(); | ||
|
||
/** | ||
* @brief Parse a single line from the log for process death information | ||
* @param line The log line to parse | ||
*/ | ||
void parse_log_line(const std::string & line); | ||
|
||
/** | ||
* @brief Timer callback to report and manage dead node list | ||
*/ | ||
void on_timer(); | ||
|
||
// Map to track dead nodes: [node_name-#] -> true | ||
std::unordered_map<std::string, bool> dead_nodes_; | ||
|
||
rclcpp::TimerBase::SharedPtr timer_; | ||
|
||
// Launch log file path and read position | ||
std::filesystem::path launch_log_path_; | ||
size_t last_file_pos_{static_cast<size_t>(-1)}; | ||
|
||
// Parameters | ||
std::vector<std::string> ignore_node_names_; // Node names to exclude from monitoring | ||
std::vector<int64_t> ignore_exit_codes_; // Exit codes to ignore (e.g., normal termination) | ||
double check_interval_{1.0}; // Check interval in seconds | ||
bool enable_debug_{false}; // Enable debug output | ||
}; | ||
|
||
} // namespace autoware::node_death_monitor | ||
|
||
#endif // AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_ |
12 changes: 12 additions & 0 deletions
12
system/autoware_node_death_monitor/launch/autoware_node_death_monitor.launch.xml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
<launch> | ||
<!-- Parameter --> | ||
<arg name="config_file" default="$(find-pkg-share autoware_node_death_monitor)/config/autoware_node_death_monitor.param.yaml"/> | ||
|
||
<!-- Set log level --> | ||
<arg name="log_level" default="info"/> | ||
|
||
<node pkg="autoware_node_death_monitor" exec="autoware_node_death_monitor_node" name="node_death_monitor" output="screen" args="--ros-args --log-level $(var log_level)"> | ||
<!-- Parameter --> | ||
<param from="$(var config_file)"/> | ||
</node> | ||
</launch> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
<?xml version="1.0"?> | ||
<package format="3"> | ||
<name>autoware_node_death_monitor</name> | ||
<version>0.0.1</version> | ||
<description>The node_death_monitor package</description> | ||
|
||
<maintainer email="kyoichi.sugahara@tier4.jp">Kyoichi Sugahara</maintainer> | ||
<license>Apache License 2.0</license> | ||
|
||
<buildtool_depend>ament_cmake_auto</buildtool_depend> | ||
<buildtool_depend>autoware_cmake</buildtool_depend> | ||
|
||
<depend>rcl_interfaces</depend> | ||
<depend>rclcpp</depend> | ||
<depend>rclcpp_components</depend> | ||
|
||
<test_depend>ament_cmake_gtest</test_depend> | ||
<test_depend>ament_lint_auto</test_depend> | ||
|
||
<export> | ||
<build_type>ament_cmake</build_type> | ||
</export> | ||
</package> |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
node_alive_monitor
orprocess_alive_monitor
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed name to
process_alive_monitor
in 572427d