3D Object Detection for Autonomous Driving: A Survey
Rui Qiana,Xin Laiband Xirong Lia,
aKey Lab of Data Engineering and Knowledge Engineering, Renmin University of China, Beijing 100872, China.
bSchool of Mathematics, Renmin University of China, Beijing 100872, China
ARTICLE INFO
Keywords:
3D object detection
Point clouds
Autonomous driving
ABSTRACT
Autonomous driving is regarded as one of the most promising remedies to shield human beings
from severe crashes. To this end, 3D object detection serves as the core basis of perception stack
especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a
quick glance at the progress we have made, we attribute challenges to visual appearance recovery
in the absence of depth information from images, representation learning from partially occluded
unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities.
Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently,
a large body of literature have been investigated to address this 3D vision task. Nevertheless, few
investigations have looked into collecting and structuring this growing knowledge. We therefore aim
to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors,
datasets, performance metrics and the recent state-of-the-art detection methods, together with their
pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case
study on fifteen selected representative methods is presented, involved with runtime analysis, error
analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis
of the surveyed works and identify promising directions for future work.
©2022 Elsevier Ltd. All rights reserved.
©2022 Elsevier Ltd. All rights reserved.
1. INTRODUCTION
Dream sheds light on reality. It is a dream that au-
tonomous vehicles hit the roads legally, functioning wisely
as good as human drivers or even better, responding timely
to various unconstrained driving scenarios, and being fully
free of the control of human drivers, a.k.a. Level 5 wit
“driver off” in Fig. 1. Let the dream be realized, thousands
of new employment opportunities shall be created for those
physically impaired (Mobility), millions of lives shall be
rescued from motor vehicle-related crashes (Safety), and
billions of dollars shall be saved from disentangling traffic
accidents and treating the wounded (Economics). It is a
reality that there is still no universal consensus on where we
are now and how we shall go next. As illustrated in Fig. 1, We
are largely above level 2 but under or infinitely close to level
3 by taking into account the following three social concerns:
(1) Safety and Security. Rules and regulations are still blank,
which shall be developed by governments to guarantee the
safety for an entire trip. (2) Law and Liability. How to define
the major responsibility and who will take that responsibility
shall be identified both ethically and clearly. (3) Acceptance.
Long-term efforts shall be made to establish the confidence
and trust for the whole society, before autonomous driv-
ing can be finally accepted. This survey paper will take a
structured glance at 3D object detection, one of the core
techniques for autonomous driving.
Perception in 3D space is a prerequisite in autonomous
driving. A fully understanding of what is happening right
Corresponding author
https://doi.org/10.1016/j.patcog.2022.108796
©2022 Elsevier Ltd. All rights reserved.
No
Automation
LEVEL 0 LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5
Driver
Assistance High
Automation
Conditional
Automation Full
Automation
Partial
Automation
Advanced Driving Assistance System(ADAS) Autonomous Driving(AD)
No System
Feet Off
Hands Off
Eyes Off
Mind Off
Driver Off
Figure 1: Levels of autonomous driving proposed by SAE
(Society of Automotive Engineers) International [1]. Where
are we now?
now in front of the vehicle will facilitate downstream com-
ponents to react accordingly, which is exactly what 3D
object detection aims for. 3D object detection perceives and
describes what surrounds us via assigning a label, how its
shape looks like via drawing a bounding box, and how far
away it is from an ego vehicle via giving a coordinate.
Besides, 3D detection even provides a heading angle that
indicates orientation. It is these upstream information from
perception stack that enables downstream planning model to
make decisions.
1
arXiv:2106.10823v3 [cs.CV] 25 May 2022
R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796
Figure 2: An overview of 3D object detection task from images and point clouds. Typical challenges: (a) Point Miss. When
LiDAR signals fail to return back from the surface of objects. (b) External Occlusion. When LiDAR signals are blocked by occluders
in the vicinity. (c) Self Occlusion. When one near side of the object blocks the other, which makes point clouds 2.5D in practice.
Note that bounding box prediction in (d) is much easier than that in (e) due to the sparsity of point clouds at long ranges.
1.1. Tasks and Challenges
Fig. 2presents an overview of 3D object detection task
from images and point clouds. The whole goal of 3D object
detection is to recognize the objects of interest by drawing
an oriented 3D bounding box and assigning a label. Consider
two commonly used 3D object detection modalities, i.e. im-
ages and point clouds, the key challenges of this vision task
are strongly tied to the way we use, the way we represent,
and the way we combine. With only images on hand, the
core challenge arises from the absence of depth information.
Although much progress has been made to recover depth
from images [2,3], it is still an ill-posed inverse problem.
The same object in different 3D poses can result in dif-
ferent visual appearance in the image plane, which is not
conducive to the learning of representation [4]. Besides,
given that camera is passive sensor (see Sec. 2.2.1), images
are naturally vulnerable to illumination (e.g., nighttime) or
rainy weather conditions. With only point clouds on hand,
the key difficulty stems from the representation learning.
Point clouds are sparse by nature, e.g. in works [5,6], non-
empty voxels normally account for approximately 1%, 3%
in a typical range setup on Waymo Open [7] dataset and
KITTI [8] dataset respectively. Point clouds are irregular and
unordered by nature. Directly applying convolution operator
to point clouds will incur “desertion of shape information
and variance to point ordering” [9]. Besides, point clouds
are 2.5D in practice as they are point miss in (a), external-
occlusion in (b), self-occlusion in (c) [10], as indicated
in Fig. 2. Without the umbrella of convolutional neural
networks, one way is to present point clouds as voxels. The
dilemma is that scaling up voxel size will loss resolution and
consequently degrade localization accuracy while scaling
down its size will cubically increase the complexity and
memory footprints as the input resolution grows. Another
way is to present point clouds as point sets. Nevertheless,
around 80% of runtime is occupied by point retrieval, say,
2
R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796
Rahman et al.[32]—TIP’2019 Arnold et al.[34]TITS’2019 This paper
Guo et al.[33]—TPAMI’2020 Di et al.[35]TITS’2021
•—Image based Methods
|
|
|Point Cloud based Methods
| |View based
| |Voxel-Grid based
| |Unstructured Point cloud based
|Fusion based Methods
|Early Fusion
|Deep Fusion
|Late Fusion
•—Image based Methods
|
|
|Point Cloud based Methods
| |Projection based
| |Volumetric Convolutional
| |Point-Nets
|Fusion based Methods
|Early Fusion
|Deep Fusion
|Late Fusion
•—Region Proposal based Methods
| |Multi-view based
| |Segmentation based
| |Frustum based
| |Other Methods
|Single Shots Methods
| |BEV based
| |Discretization based
| |Point based
|Other Methods
•—What to Fuse
|How to Fuse
|When to Fuse
•—Image based Methods
| |Result-lifting based
| |Feature-lifting based
|Point Cloud based Methods
| |Voxel based
| |Point based
| |Point-voxel based
|Fusion based Methods
|Sequential Fusion
|Parallel Fusion
|Early Fusion
|Deep Fusion
|Late Fusion
+ 3D object detection + 3D object detection in the context of autonomous driving
+ 3D shape classification
+ 3D object detection and tracking
+ 3D point cloud segmentation
+ Multi-modal object detection
for autonomous driving
+ Targeted scope
•— Taxonomy
|
Figure 3: A summary showing how this survey differs from existing ones on 3D object detection. Vertically, targeted scope
concisely determines where the boundary is located among their investigations. Horizontally, hierarchical branches of this paper
reveal a good continuity of existing efforts [13,14] while adapt new branches (indicated in bold font) for dynamics, which
importantly contributes to the maturity of the taxonomy on 3D object detection.
ball query operation, in light of the poor memory locality
[11]. With both images and point clouds on hand, the tricky
obstacle often derives from semantic alignments. Images
and point clouds are two heterogeneous media, presented in
camera view and real 3D view, finding point-wise correspon-
dences between LiDAR points and image pixels results in
“feature blurring” [12].
1.2. Targeted Scope, Aims, and Organization
We review literature that are closely related to 3D object
detection in the context of autonomous driving. Depending
on what modality we use, existing efforts are divided into the
following three subdivisions: (1) image based [15,16,17,
18,19,20,21,22], which is relatively inaccurate but several
orders of magnitude cheaper, and more interpretable under
the guidance of domain expertise and knowledge priors. (2)
point cloud based [23,24,25,10,6,26,27,28,29,5], which
has a relatively higher accuracy and lower latency but more
prohibitive deployment cost compared with its image based
counterparts. (3) multimodal fusion based [12,30,31,32,
33,34], which currently lags behind its point cloud based
counterparts but importantly provides a redundancy to fall-
back onto in case of a malfunction or outage.
Fig. 3presents a summary showing how this survey
differs from existing ones on 3D object detection. 3D object
detection itself is an algorithmic problem, whereas involved
with autonomous driving makes it an application issue.
As of this main text in June 2021, we notice that rather
few investigations [13,35,14,36] have looked into this
application issue. Survey [13] focuses on 3D object detec-
tion, also taking into account indoor detection. Survey [36]
involves with autonomous driving but it concentrates on
multi-modal object detection. Survey [35] covers a series
of related subtopics of 3D point clouds (e.g., 3D shape
classification, 3D object detection and tracking, and 3D point
cloud segmentation etc.). Note that survey [35] establishes
it taxonomy based on network architecture, which fails to
summarize the homogeneous properties among methods and
therefore results in overlapping in the subdivisions, e.g.
multi-view based and BEV based are the same in essence in
terms of learning strategies. As far as we know, only survey
[14] is closely relevant to this paper, but it fails to track the
latest datasets (e.g., nuScenes [37], and Waymo Open [7]),
algorithms (e.g., the best algorithm it reports on KITII [8]
3D detection benchmark is AVOD [31][IROS18: 71.88] vs.
BtcDet [10][AAAI22: 82.86] in this paper), and challenges,
which is not surprising as much progress has been made after
2018.
The aims of this paper are threefold. First, notice that
no recent literature exists to collect the growing knowledge
concerning 3D object detection, we aim to fill this gap by
starting with several basic concepts, providing a glimpse
3
R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796
of evolution of 3D object detection, together with compre-
hensive comparisons on publicly available datasets being
manifested, with pros and cons being judiciously presented.
Witnessing the absence of a universal consensus on taxon-
omy with respect to 3D object detection, our second goal
is to contribute to the maturity of the taxonomy. To this
end, we are cautiously favorable to the taxonomy based
on input modality, approved by existing literature [13,14].
The idea of grouping literature based on their network ar-
chitecture derives from 2D object detection, which fails to
summarize the homogeneous properties among methods and
therefore results in overlapping in the subdivisions, e.g.,
multi-view based and BEV based are the same representa-
tion in essence. Another drawback is that several plug-and-
play module components can be integrated into either region
proposal based (two-stage) or single shot based (one-stage).
For instance, VoTr [5] proposes voxel transformer which can
be easily incorporated into voxel based one stage or two stage
detectors. Notice that diverse fusion variants consistently
emerge among 3D object detection, existing taxonomy in
works [13,14] needs to be extended. For instance, works [38,
38] are sequential fusion methods (see Sec. 3.3.1), which are
not well suited to existing taxonomy. We therefore define two
new paradigms, i.e. sequential fusion and parallel fusion,
to adapt to underlying changes and further discuss which
category each method belongs to explicitly, while works
[13,14] not. Also, we analyze in deep to provide a more
fine-grained taxonomy above and beyond the existing efforts
[13,14] on image based methods, say, result-lifting based
and feature-lifting based depending on intermediate rep-
resentation existence. Finally, we open point-voxel branch
to classify newly proposed variants, e.g. PV-RCNN [39].
By constrast, survey [35] directly groups PV-RCNN into
“Other Methods”, leaving the problem unsolved. Our third
goal aims to present a case study on fifteen selected models
among surveyed works, with regard to runtime analysis,
error analysis, and robustness analysis closely. We argue that
what mainly restricts the performance of detection is 3D
location error based on our findings. Taken together, this
survey is expected to foster more follow-up literature in 3D
vision community.
The rest of this paper is organized as follows. Section 2
introduces background associated with foundations, sensors,
datasets, and performance metrics. Section 3reviews 3D
object detection methods with their corresponding pros and
cons in the context of autonomous driving. Comprehen-
sive comparisons of the state-of-the-arts are summarized
in Section 4. We conclude this paper and identify future
research directions Section 5. We also set up a regularly
updated project page on: https://github.com/rui-qian/SoTA-
3D-Object-Detection.
2. BACKGROUND
2.1. Foundations
Let denote input data, say, LiDAR signals or monoc-
ular images, denote a detector parameterized by Θ. Con-
sider an (𝐹+1)-dimensional result subset with 𝑛predictions,
denoted by 𝐲1, ..., 𝐲𝑛𝐹+1, we have
Θ𝑀𝐿𝐸 =arg max
Θ
𝐲1, ..., 𝐲𝑛,Θ,(1)
where 𝐲𝑖=𝑖, 𝑠𝑖denotes a certain prediction of detector
(; Θ)with bounding box 𝑖𝐹and its probabilistic
score 𝑠𝑖[0,1]. In the context of autonomous driving, 𝑖is
usually parameterized as portrayed in Fig. 4, which indicates
the volume of the object of interest and its position relative to
a reference coordinate system that can be one of the sensors
equipped on a ego-vehicle. We notice that attributes encoded
by (d) in Fig. 4are orthogonal and therefore result in a more
lower information redundancy compared with (a), (b), (c). In
this paper, we adopt the form of (d) as most previous works
[25,24,6,40,41] do.
Figure 4: Comparisons of the 3D bounding box parameteriza-
tion, between 8 corners proposed in [30], 4 corners with heights
proposed in [31], the axis aligned box encoding proposed in
[42], and the 7 parameters for an oriented 3D bounding box
adopted in [25,24,6,40,41].
2.2. Sensors
We human beings leverage visual and auditory systems
to perceive the real world when driving, so how about
autonomous vehicles? If they were to drive like a human,
then to identify what they see on the road constantly is the
way to go. To this end, sensors matter. It is sensors that
empower vehicles a series of abilities: obstacles perception,
automatic emergency braking, and collision avoidance etc.
In general, the most commonly used sensors can be divided
into two categories: passive sensors and active sensors [43].
The on going debate among industry experts is whether or
not to just equip vehicles with camera systems (no LiDAR),
or deploy LiDAR together with on-board camera systems.
Given that camera is considered to be one of the typical
representatives of passive sensors, and LiDAR is regarded as
a representative of active ones, we first introduce the basic
concepts of passive sensors and active sensors, then take
camera and LiDAR as examples to discuss how they serve
the autonomous driving system, together with pros and cons
being manifested in Table 1.
2.2.1. Passive Sensors
Passive sensors are anticipated to receive natural emis-
sions emanating from both the Earths surface and its at-
mosphere. These natural emissions could be natural light or
infrared rays. Typically, a camera directly grabs a bunch of
color points from the optics in the lens and arranges them
4
R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796
Table 1
Advantages and disadvantages of different sensors.
Sensors Advantages Disadvantages Publications
Passive
Monocular Camera cheap and available for multiple situations
informative color and texture attributes
no depth or range detecting feature
susceptible to weather and light conditions
[44], [15], [16],
[45], [17], [41]
Stereo Camera
depth information provided
informative color and texture attributes
high frame rate
computationally expensive
sensitive to weather and light conditions
limited Field-of-View
[18], [46], [47]
Active
LiDAR
accurate depth or range detecting feature
less affected by external illumination
360Field-of-View
high sparseness and irregularity by nature
no color and texture attributes
expensive and critical deployment
[48], [49], [39],
[40], [50], [51],
[52], [53], [25],
[54], [6], [55], [24],
[38], [56], [57]
Solid State
LiDAR
more reliable compared with surround view
sensors
cost decrease
error increase when different points of view
are merged in real time
still under development and limited Field-of-
View
n.a.
Table 2
A summary of publicly available datasets for 3D object detection in the context of autonomous driving. *: Numbers in brackets
indicate classes evaluated in their official benchmarks.
Dataset Year Size Diversity Modality Benchmark Cites
Train Val Test Boxes Frames Scenes Classes* Night Rain Stereo Temporal LiDAR
KITTI [8,58] 2012 7,418×1 - 7,518 ×1 200K 15K 50 8 (3) No No Yes Yes Yes Yes 5011
Argoverse [37] 2019 39,384×7 15,062×7 12,507 ×7 993K 44K 113 15 Yes Yes Yes Yes Yes Yes 88
Lyft L5 [59] 2019 22,690×6 - 27,460 ×6 1.3M 46K 366 9 No No No Yes Yes No -
H3D [60] 2019 8,873×3 5,170×3 13,678 ×3 1.1M 27K 160 8 No No No Yes Yes No 31
Appllo [61,62] 2019 - - - - 140K 103 27 Yes Yes Yes Yes Yes Yes 78
nuScenes [37] 2019 28,130×6 6,019×6 6,008 ×6 1.4M 40K 1,000 23 (10) Yes Yes No Yes Yes Yes 225
Waymo [7] 2020 122,200×5 30,407×5 40,077 ×5 112M 200K 1,150 4 (3) Yes Yes No Yes Yes Yes 31
into an image array that is often referred to as a digital signal
for scene understanding. Primarily, a monocular camera
lends itself well to informative color and texture attributes,
better visual recognition of text from road signs, and high
frame rate at a negligible cost etc. Whereas, it is lack of depth
information, which is crucial for accurate location estimation
in the real 3D world. To overcome this issue, stereo cameras
use matching algorithms to align correspondences in both
left and right images for depth recovery [2]. While cameras
have shown potentials as a reliable vision system, it is hardly
sufficient as a standalone one given that a camera is prone
to degrade its accuracy in cases where luminosity is low at
night-time or rainy weather conditions occur. Consequently
equipping autonomous vehicles with an auxiliary sensor,
say active counterparts, to fall-back onto is necessary, in
case that camera system should malfunction or disconnect
in hazardous weather conditions.
2.2.2. Active Sensors
Active sensors are expected to measure reflected signals
that are transmitted by the sensor, which are bounced by the
Earths surface or its atmosphere. Typically, Light Detection
And Ranging (LiDAR) is a point-and-shoot device with
three basic components of lens, lasers and detectors, which
spits out light pulses that will bounce off the surroundings
in the form of 3D points, referred to as “point clouds”. High
sparseness and irregularity by nature and the absence of
texture attributes are the primary characteristics of a point
cloud, which is well distinguished from image array. Since
we have already known how fast light travels, the distance
of obstacles could be determined without effort. LiDAR
system emits thousands of pulses that spin around in a circle
per second, with a 360 degree view of surroundings for
the vehicles being provided. For example, Velodyne HDL-
64L produces 120 thousand points per frame with a 10 Hz
frame rate. Obviously, LiDAR is less affected by external
illumination conditions (e.g., at night-time), given that it
emits light pulses by itself. Although LiDAR system has
been hailed for high accuracy and reliability compared with
camera system, it does not always hold true. Specifically,
wavelength stability of LiDAR is susceptible to variations
in temperature, while adverse weather (e.g., snow or fog) is
prone to result in poor signal-to-noise ratio in the LiDAR’s
detector. Another issue with LiDAR is the high cost of
deployment. A conservative estimate according to Velodyne,
so far, is about $75,0001[14]. In the foreseeable future of
LiDAR, how to decrease cost and how to increase resolution
and range are where the whole community is to march
ahead. As for the former, the advent of solid state LiDAR is
expected to address this problem of cost decrease, with the
help of several stationary lasers that emit light pulses along
a fixed field of view. As for the latter, the newly announced
Velodyne VLS-128 featuring 128 laser pulses and 300m
radius range has been on sale, which is going to significantly
facilitate better recognition and tracking in terms of public
safety.
1http://www.woodsidecap.com
5
1 / 19 100%
La catégorie de ce document est-elle correcte?
Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans linterface ou les textes ? Ou savez-vous comment améliorer linterface utilisateur de StudyLib ? Nhésitez pas à envoyer vos suggestions. Cest très important pour nous !