Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add autoware_node_death_monitor package for monitoring node crashes #1786

Draft
wants to merge 4 commits into
base: tier4/main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions system/autoware_node_death_monitor/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
cmake_minimum_required(VERSION 3.14)
project(autoware_node_death_monitor)

find_package(autoware_cmake REQUIRED)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about node_alive_monitor or process_alive_monitor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed name to process_alive_monitor in 572427d

autoware_package()

ament_auto_add_library(${PROJECT_NAME} SHARED
src/autoware_node_death_monitor.cpp
)

rclcpp_components_register_node(${PROJECT_NAME}
PLUGIN "autoware::node_death_monitor::NodeDeathMonitor"
EXECUTABLE ${PROJECT_NAME}_node)

ament_auto_package(INSTALL_TO_SHARE
config
launch
)
73 changes: 73 additions & 0 deletions system/autoware_node_death_monitor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# autoware_node_death_monitor

This package provides a monitoring node that detects ROS 2 node crashes by analyzing `launch.log` files, rather than subscribing to `/rosout` logs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launch.logにメッセージが出力されることって保証されてましたっけ?もしyesならREADMEに「ROS2の仕組みとして必ずlounch.logに出力される」とか、もしくは「ooオプションがONになっていることを確認すること」とかを記載したい。


---

## Overview

- **Node name**: `autoware_node_death_monitor`
- **Monitored file**: `launch.log`
- **Detected event**: Looks for lines containing the substring `"process has died"` and extracts the node name and exit code.

When a crash or unexpected shutdown occurs, `ros2 launch` typically outputs a line in `launch.log` such as:

```bash
[ERROR] [node_name-1]: process has died [pid 12345, exit code 139, cmd '...']
```

The `autoware_node_death_monitor` node continuously reads the latest `launch.log` file, detects these messages, and logs a warning or marks the node as "dead."

---

## How it Works

1. **Find `launch.log`**:
- First, checks the `ROS_LOG_DIR` environment variable.
- If not set, falls back to `~/.ros/log`.
- Identifies the latest log directory based on modification time.
2. **Monitor `launch.log`**:
- Reads the file from the last known position to detect new log entries.
- Looks for lines containing `"process has died"`.
- Extracts the node name and exit code.
3. **Filtering**:
- **Ignored node names**: Nodes matching patterns in `ignore_node_names` are skipped.
- **Ignored exit codes**: Logs with ignored exit codes are not flagged as errors.
4. **Regular Updates**:
- A timer periodically reads new entries from `launch.log`.
- Dead nodes are reported in the logs. (will be changed to publish diagnostics)

---

## Parameters

| Parameter Name | Type | Default | Description |
| ------------------- | ---------- | ----------------- | ---------------------------------------------------------- |
| `ignore_node_names` | `string[]` | `[]` (empty list) | Node name patterns to ignore. E.g., `['rviz2']`. |
| `ignore_exit_codes` | `int[]` | `[]` (empty list) | Exit codes to ignore (e.g., `0` or `130` for normal exit). |
| `check_interval` | `double` | `1.0` | Timer interval (seconds) for scanning the log file. |
| `enable_debug` | `bool` | `false` | Enables debug logging for detailed output. |

Example **`autoware_node_death_monitor.param.yaml`**:

```yaml
autoware_node_death_monitor:
ros__parameters:
ignore_node_names:
- rviz2
- teleop_twist_joy
ignore_exit_codes:
- 0
- 130
check_interval: 1.0
enable_debug: false
```

---

## Limitations

- **後で書く**: TBD.
- **Robust Monitoring**: Works alongside systemd, supervisord, or other process supervisors for enhanced fault detection.

Check warning on line 71 in system/autoware_node_death_monitor/README.md

View workflow job for this annotation

GitHub Actions / spell-check-differential

Unknown word (supervisord)

---
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
/**:
ros__parameters:
# Node names to exclude from monitoring (Note: be careful with the "[node_name-#]" format)
# Example: Do not issue a warning if rviz2 crashes.
ignore_node_names:
- rviz2

# Exit codes to exclude from monitoring (e.g., Ctrl+C)
# Example: 0, 130 are considered normal exits and not treated as errors.
ignore_exit_codes:
- 0
- 130

# Check interval (seconds)
check_interval: 1.0

# Enable/disable debug output
enable_debug: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
// Copyright 2025 Tier IV, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#ifndef AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_
#define AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_

#include "rclcpp/rclcpp.hpp"

#include <filesystem>
#include <string>
#include <unordered_map>
#include <vector>

namespace autoware::node_death_monitor
{

class NodeDeathMonitor : public rclcpp::Node
{
public:
/**
* @brief Constructor for NodeDeathMonitor
* @param options Node options for configuration
*/
explicit NodeDeathMonitor(const rclcpp::NodeOptions & options);

private:
/**
* @brief Read and process new content appended to launch.log
*/
void read_launch_log_diff();

/**
* @brief Parse a single line from the log for process death information
* @param line The log line to parse
*/
void parse_log_line(const std::string & line);

/**
* @brief Timer callback to report and manage dead node list
*/
void on_timer();

// Map to track dead nodes: [node_name-#] -> true
std::unordered_map<std::string, bool> dead_nodes_;

rclcpp::TimerBase::SharedPtr timer_;

// Launch log file path and read position
std::filesystem::path launch_log_path_;
size_t last_file_pos_{static_cast<size_t>(-1)};

// Parameters
std::vector<std::string> ignore_node_names_; // Node names to exclude from monitoring
std::vector<int64_t> ignore_exit_codes_; // Exit codes to ignore (e.g., normal termination)
double check_interval_{1.0}; // Check interval in seconds
bool enable_debug_{false}; // Enable debug output
};

} // namespace autoware::node_death_monitor

#endif // AUTOWARE_NODE_DEATH_MONITOR__AUTOWARE_NODE_DEATH_MONITOR_HPP_
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<launch>
<!-- Parameter -->
<arg name="config_file" default="$(find-pkg-share autoware_node_death_monitor)/config/autoware_node_death_monitor.param.yaml"/>

<!-- Set log level -->
<arg name="log_level" default="info"/>

<node pkg="autoware_node_death_monitor" exec="autoware_node_death_monitor_node" name="node_death_monitor" output="screen" args="--ros-args --log-level $(var log_level)">
<!-- Parameter -->
<param from="$(var config_file)"/>
</node>
</launch>
23 changes: 23 additions & 0 deletions system/autoware_node_death_monitor/package.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<?xml version="1.0"?>
<package format="3">
<name>autoware_node_death_monitor</name>
<version>0.0.1</version>
<description>The node_death_monitor package</description>

<maintainer email="kyoichi.sugahara@tier4.jp">Kyoichi Sugahara</maintainer>
<license>Apache License 2.0</license>

<buildtool_depend>ament_cmake_auto</buildtool_depend>
<buildtool_depend>autoware_cmake</buildtool_depend>

<depend>rcl_interfaces</depend>
<depend>rclcpp</depend>
<depend>rclcpp_components</depend>

<test_depend>ament_cmake_gtest</test_depend>
<test_depend>ament_lint_auto</test_depend>

<export>
<build_type>ament_cmake</build_type>
</export>
</package>
Loading
Loading