[arXiv] [Project Page]
A joint work: video-based 3D representation for 2D MOT [arXiv]
TL;DR: We propose a novel 2D supervised monocular 3D object detection paradigm, leveraging the idea of global (scene-level) to local (instance-level) 3D reconstruction.
With the advent of the big model era, the demand for data has become more important. Especially in monocular 3D object detection, expensive manual annotations potentially limit further developments. Existing works have investigated weakly supervised algorithms with the help of LiDAR modality to generate 3D pseudo labels, which cannot be applied to ordinary videos. In this paper, we propose a novel paradigm, termed as BA2-Det, leveraging the idea of global-to-local 3D reconstruction for 2D supervised monocular 3D object detection. Specifically, we recover 3D structures from monocular videos by scene-level global reconstruction with global bundle adjustment (BA) and obtain object clusters by the DoubleClustering algorithm. Learning from completely reconstructed objects in global BA, GBA-Learner predicts pseudo labels for occluded objects. Finally, we train an LBA-Learner with object-centric local BA to generalize the generated 3D pseudo labels to moving objects. Experiments on the large-scale Waymo Open Dataset show that the performance of BA2-Det is on par with the fully-supervised BA-Det trained with 10% videos and even outperforms some pioneer fully-supervised methods. We also show the great potential of BA2-Det for detecting open-set 3D objects in complex scenes.
Pipeline of BA2-Det. We take the video sequence as input. The Global BA stage is to generate 3D pseudo labels from scene-level global reconstruction, including DoubleCluster and GBA-Learner. Then the labels are sent to the Local BA stage, which is to learn a monocular 3D object detector in an iterative way.
Please refer to our project page for video demos.
We show demos for
BA2-Det:
@article{he2023ba2det,
title={2D Supervised Monocular 3D Object Detection by Global-to-Local 3D Reconstruction},
author={Jiawei He and Yuqi Wang and Yuntao Chen and Zhaoxiang Zhang},
journal={arXiv preprint arXiv:2306.05418},
year={2023}
}
P3DTrack:
@article{he2023p3dtrack,
title={Tracking Objects with 3D Representation from Videos},
author={Jiawei He and Lue Fan and Yuqi Wang and Yuntao Chen and Zehao Huang and Naiyan Wang and Zhaoxiang Zhang},
journal={arXiv preprint arXiv:2306.05416},
year={2023}
}
The authors thank these great works: FSD, LSMOL, SAM,COLMAP,hloc, SuperGlue,LoFTR.