feat(evaluation): Design doc for 3D detection in ml-evaluation pipeline #4

KSeangTan · 2025-02-28T07:13:34Z

Summary

This PR drafts the discussion and necessary changes in the evaluation pipeline to use autoware_perception_evaluation to evaluate performance of perception models.

Problem Statement

The current ml pipeline uses t4_devkit when creating pickle files for training/evaluation, however, it still loads ground truths from nuscenes_devkit in the evaluation pipeline. This might actually introduce unintended behavior due to two different devkit we use in training and evaluation pipelines.
For example, it needs to transform both predictions and ground truths to the global coordinate before evaluating them, and it should be avoided since we know that predictions and ground truths (pickle files) are in the same coordinate system when retrieving them in t4_devkit.get_sample_data for pickles.
Besides, NuScenesMetric actually has different metrics compared to autoware_perception_eval, we should try to evaluate our ml experiments using the same evaluation to reduce the gap as much as possible.

Goals

Remove all nuscene_devkit dependencies in T4Metric
Use autoware_perception_eval and t4_devkit for evaluation
Reduce evaluation time and make sure it passes regression testing

Designs

Current Pipeline

Proposed Pipeline

Few considerations in the new pipeline:

Computation and postprocessing should be done in autoware_perception_evaluation, T4Metric should be a wrapper/interface to run inference and call autoware_perception_evaluation only. In this case, we should rename it to T4Evaluator
It should only calibrate confidence thresholds of every class in a validation set, and then it uses the values to run test set.
- In another word, users should provide confidence thresholds when independently running evaluation on a test set
Support configurations for the evaluation pipeline through configs instead of hard-coded filtering in the current pipeline
We need to make few changes in autoware_perception_evaluation before starting to work on T4Metric:
- Support nuscene metrics as mentioned in here
- Make CriticalObjectFilterConfig and PerceptionPassFailConfig optional since it might not be used in ml experiments in the beginning
- Support loading FrameGroundTruth and sensor data without providing dataset_paths
- Support serialization/pickling in FrameGroundTruth and DynamicObject

Plan of PRs:

autoware_perception_evaluation:
1. Implementation of nuScene metrics in autoware_perception_evaluation, this includes NDS and calibration of confidence thresholds (2 days)
2. Make filter optional (0.5 day)
3. Support loading FrameGroundTruth and sensor data without providing dataset_paths (0.5 day)
AWML:
1. Introduce T4Frame and refactor inference to save predictions/gts in every step, also save intermediate results results.pickle for all scenes (1 day)
2. Configuration of autoware_perception_evaluation through experiment configs, and process T4Frame with autoware_perception_evaluation.add_frame_result and autoware_perception_evaluation.get_scene_result (2 days)
3. Visualize metrics and worst K samples (1.5 day)
4. Unit tests for simple cases (0.5 day)

ETA: 8 - 9 days

To prevent significant impacts of ongoing PRs to running experiments, we will make changes of autoware_perception_evaluation and autoware_ml in an independent feature branch, and only merge it to main once it has been tested solid in both regression testing and running time.

scepter914 · 2025-03-05T10:28:35Z

@KSeangTan

Thank for PR and I really appreciate you preparing such detailed documentation.

Computation and postprocessing should be done in autoware_perception_evaluation, T4Metric should be a wrapper/interface to run inference and call autoware_perception_evaluation only. In this case, we should rename it to T4Evaluator

Since NuscenesMetric is already to run inference and call nuscenes-devkit, I think it is OK to call T4Metrics even if we use autoware_perception_evaluation if T4Metric creates result.json and metrics.json at the same script.
(And I think it is understandable for user of AWML and mmdetection3d, and it might be a good idea to allow recalculating from result.json by configuring it in the same way as setting a pre-trained model, using: result_json = None # or {path to json file})

It should only calibrate confidence thresholds of every class in a validation set, and then it uses the values to run test set.
In another word, users should provide confidence thresholds when independently running evaluation on a test set.

Yes, and we need to reconstruct train/val/test.

Support configurations for the evaluation pipeline through configs instead of hard-coded filtering in the current pipeline

Agreed.
Some threshold is hard coding now, so we rewrite for it in core libraries in some case like https://github.com/tier4/AWML/blob/main/autoware_ml/detection3d/evaluation/t4metric/t4metric.py#L136.

We need to make few changes in autoware_perception_evaluation before starting to work on T4Metric:

Understood.
Let's work on each task one by one.

Pipeline design in the figure

I want to put comments about "the new module in T4metric" written in the figure.
Can we replace from "T4Frame" to "PerceptionFrameResult"?
It seems to support loading pickle file in PerceptionAnalyzer3D class as the document
If we can use the class as same as autoware_perception_evaluation, it requires less maintenance.

As NITS comment, it is better to explain in the document about "T4 pickle".
It would be easier to understand if it were written by an example file name as "t4dataset_base_infos_val.pkl".

Plan of PRs

Your plan make sense.
Anyway, the whole pipeline looks great to me, so would you start to write the document written in PR to /docs/design/architecture_evaluation.md like dataset document in this PR?

Create a design doc in pr description

418507c

KSeangTan self-assigned this Feb 28, 2025

KSeangTan requested review from knzo25, scepter914, amadeuszsz and SamratThapa120 February 28, 2025 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): Design doc for 3D detection in ml-evaluation pipeline #4

feat(evaluation): Design doc for 3D detection in ml-evaluation pipeline #4

KSeangTan commented Feb 28, 2025 •

edited

Loading

scepter914 commented Mar 5, 2025 •

edited

Loading

feat(evaluation): Design doc for 3D detection in ml-evaluation pipeline #4

Are you sure you want to change the base?

feat(evaluation): Design doc for 3D detection in ml-evaluation pipeline #4

Conversation

KSeangTan commented Feb 28, 2025 • edited Loading

Summary

Problem Statement

Goals

Designs

Current Pipeline

Proposed Pipeline

scepter914 commented Mar 5, 2025 • edited Loading

KSeangTan commented Feb 28, 2025 •

edited

Loading

scepter914 commented Mar 5, 2025 •

edited

Loading