3D Object Detection for Autonomous Driving: A Survey

Telechargé par Sparsh Garg

Téléchargement

Rui Qiana,Xin Laiband Xirong Lia,∗

aKey Lab of Data Engineering and Knowledge Engineering, Renmin University of China, Beijing 100872, China.

bSchool of Mathematics, Renmin University of China, Beijing 100872, China

ARTICLE INFO

Keywords:

3D object detection

Point clouds

Autonomous driving

ABSTRACT

Autonomous driving is regarded as one of the most promising remedies to shield human beings

from severe crashes. To this end, 3D object detection serves as the core basis of perception stack

especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a

quick glance at the progress we have made, we attribute challenges to visual appearance recovery

in the absence of depth information from images, representation learning from partially occluded

unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities.

Despite existing eﬀorts, 3D object detection for autonomous driving is still in its infancy. Recently,

a large body of literature have been investigated to address this 3D vision task. Nevertheless, few

investigations have looked into collecting and structuring this growing knowledge. We therefore aim

to ﬁll this gap in a comprehensive survey, encompassing all the main concerns including sensors,

datasets, performance metrics and the recent state-of-the-art detection methods, together with their

pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case

study on ﬁfteen selected representative methods is presented, involved with runtime analysis, error

analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis

of the surveyed works and identify promising directions for future work.

1. INTRODUCTION

Dream sheds light on reality. It is a dream that au-

tonomous vehicles hit the roads legally, functioning wisely

as good as human drivers or even better, responding timely

to various unconstrained driving scenarios, and being fully

free of the control of human drivers, a.k.a. Level 5 wit

“driver oﬀ” in Fig. 1. Let the dream be realized, thousands

of new employment opportunities shall be created for those

physically impaired (Mobility), millions of lives shall be

rescued from motor vehicle-related crashes (Safety), and

billions of dollars shall be saved from disentangling traﬃc

accidents and treating the wounded (Economics). It is a

reality that there is still no universal consensus on where we

are now and how we shall go next. As illustrated in Fig. 1, We

are largely above level 2 but under or inﬁnitely close to level

3 by taking into account the following three social concerns:

(1) Safety and Security. Rules and regulations are still blank,

which shall be developed by governments to guarantee the

safety for an entire trip. (2) Law and Liability. How to deﬁne

the major responsibility and who will take that responsibility

shall be identiﬁed both ethically and clearly. (3) Acceptance.

Long-term eﬀorts shall be made to establish the conﬁdence

and trust for the whole society, before autonomous driv-

ing can be ﬁnally accepted. This survey paper will take a

structured glance at 3D object detection, one of the core

techniques for autonomous driving.

Perception in 3D space is a prerequisite in autonomous

driving. A fully understanding of what is happening right

∗Corresponding author

[email protected] (R. Qian); [email protected] (X. Lai);

[email protected] (X. Li)

https://doi.org/10.1016/j.patcog.2022.108796

Automation

LEVEL 0 LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5

Driver

Assistance High

Automation

Conditional

Automation Full

Automation

Partial

Automation

Advanced Driving Assistance System(ADAS) Autonomous Driving(AD)

No System

Feet Off

Hands Off

Eyes Off

Mind Off

Driver Off

Figure 1: Levels of autonomous driving proposed by SAE

(Society of Automotive Engineers) International [1]. Where

are we now?

now in front of the vehicle will facilitate downstream com-

ponents to react accordingly, which is exactly what 3D

object detection aims for. 3D object detection perceives and

describes what surrounds us via assigning a label, how its

shape looks like via drawing a bounding box, and how far

away it is from an ego vehicle via giving a coordinate.

Besides, 3D detection even provides a heading angle that

indicates orientation. It is these upstream information from

perception stack that enables downstream planning model to

make decisions.

arXiv:2106.10823v3 [cs.CV] 25 May 2022

R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796

Figure 2: An overview of 3D object detection task from images and point clouds. Typical challenges: (a) Point Miss. When

LiDAR signals fail to return back from the surface of objects. (b) External Occlusion. When LiDAR signals are blocked by occluders

in the vicinity. (c) Self Occlusion. When one near side of the object blocks the other, which makes point clouds 2.5D in practice.

Note that bounding box prediction in (d) is much easier than that in (e) due to the sparsity of point clouds at long ranges.

1.1. Tasks and Challenges

Fig. 2presents an overview of 3D object detection task

from images and point clouds. The whole goal of 3D object

detection is to recognize the objects of interest by drawing

an oriented 3D bounding box and assigning a label. Consider

two commonly used 3D object detection modalities, i.e. im-

ages and point clouds, the key challenges of this vision task

are strongly tied to the way we use, the way we represent,

and the way we combine. With only images on hand, the

core challenge arises from the absence of depth information.

Although much progress has been made to recover depth

from images [2,3], it is still an ill-posed inverse problem.

The same object in diﬀerent 3D poses can result in dif-

ferent visual appearance in the image plane, which is not

conducive to the learning of representation [4]. Besides,

given that camera is passive sensor (see Sec. 2.2.1), images

are naturally vulnerable to illumination (e.g., nighttime) or

rainy weather conditions. With only point clouds on hand,

the key diﬃculty stems from the representation learning.

Point clouds are sparse by nature, e.g. in works [5,6], non-

empty voxels normally account for approximately 1%, 3%

in a typical range setup on Waymo Open [7] dataset and

KITTI [8] dataset respectively. Point clouds are irregular and

unordered by nature. Directly applying convolution operator

to point clouds will incur “desertion of shape information

and variance to point ordering” [9]. Besides, point clouds

are 2.5D in practice as they are point miss in (a), external-

occlusion in (b), self-occlusion in (c) [10], as indicated

in Fig. 2. Without the umbrella of convolutional neural

networks, one way is to present point clouds as voxels. The

dilemma is that scaling up voxel size will loss resolution and

consequently degrade localization accuracy while scaling

down its size will cubically increase the complexity and

memory footprints as the input resolution grows. Another

way is to present point clouds as point sets. Nevertheless,

around 80% of runtime is occupied by point retrieval, say,

R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796

Rahman et al.[32]—TIP’2019 Arnold et al.[34]—TITS’2019 This paper

Guo et al.[33]—TPAMI’2020 Di et al.[35]—TITS’2021

•—Image based Methods

|—Point Cloud based Methods

| |—View based

| |—Voxel-Grid based

| |—Unstructured Point cloud based

|—Fusion based Methods

|—Early Fusion

|—Deep Fusion

|—Late Fusion

•—Image based Methods

|—Point Cloud based Methods

| |—Projection based

| |—Volumetric Convolutional

| |—Point-Nets

|—Fusion based Methods

|—Early Fusion

|—Deep Fusion

|—Late Fusion

•—Region Proposal based Methods

| |—Multi-view based

| |—Segmentation based

| |—Frustum based

| |—Other Methods

|—Single Shots Methods

| |—BEV based

| |—Discretization based

| |—Point based

|—Other Methods

•—What to Fuse

|—How to Fuse

|—When to Fuse

•—Image based Methods

| |—Result-lifting based

| |—Feature-lifting based

|—Point Cloud based Methods

| |—Voxel based

| |—Point based

| |—Point-voxel based

|—Fusion based Methods

|—Sequential Fusion

|—Parallel Fusion

|—Early Fusion

|—Deep Fusion

|—Late Fusion

+ 3D object detection + 3D object detection in the context of autonomous driving

+ 3D shape classification

+ 3D object detection and tracking

+ 3D point cloud segmentation

+ Multi-modal object detection

for autonomous driving

+ Targeted scope

•— Taxonomy

Figure 3: A summary showing how this survey diﬀers from existing ones on 3D object detection. Vertically, targeted scope

concisely determines where the boundary is located among their investigations. Horizontally, hierarchical branches of this paper

reveal a good continuity of existing eﬀorts [13,14] while adapt new branches (indicated in bold font) for dynamics, which

importantly contributes to the maturity of the taxonomy on 3D object detection.

ball query operation, in light of the poor memory locality

[11]. With both images and point clouds on hand, the tricky

obstacle often derives from semantic alignments. Images

and point clouds are two heterogeneous media, presented in

camera view and real 3D view, ﬁnding point-wise correspon-

dences between LiDAR points and image pixels results in

“feature blurring” [12].

1.2. Targeted Scope, Aims, and Organization

We review literature that are closely related to 3D object

detection in the context of autonomous driving. Depending

on what modality we use, existing eﬀorts are divided into the

following three subdivisions: (1) image based [15,16,17,

18,19,20,21,22], which is relatively inaccurate but several

orders of magnitude cheaper, and more interpretable under

the guidance of domain expertise and knowledge priors. (2)

point cloud based [23,24,25,10,6,26,27,28,29,5], which

has a relatively higher accuracy and lower latency but more

prohibitive deployment cost compared with its image based

counterparts. (3) multimodal fusion based [12,30,31,32,

33,34], which currently lags behind its point cloud based

counterparts but importantly provides a redundancy to fall-

back onto in case of a malfunction or outage.

Fig. 3presents a summary showing how this survey

diﬀers from existing ones on 3D object detection. 3D object

detection itself is an algorithmic problem, whereas involved

with autonomous driving makes it an application issue.

As of this main text in June 2021, we notice that rather

few investigations [13,35,14,36] have looked into this

application issue. Survey [13] focuses on 3D object detec-

tion, also taking into account indoor detection. Survey [36]

involves with autonomous driving but it concentrates on

multi-modal object detection. Survey [35] covers a series

of related subtopics of 3D point clouds (e.g., 3D shape

classiﬁcation, 3D object detection and tracking, and 3D point

cloud segmentation etc.). Note that survey [35] establishes

it taxonomy based on network architecture, which fails to

summarize the homogeneous properties among methods and

therefore results in overlapping in the subdivisions, e.g.

multi-view based and BEV based are the same in essence in

terms of learning strategies. As far as we know, only survey

[14] is closely relevant to this paper, but it fails to track the

latest datasets (e.g., nuScenes [37], and Waymo Open [7]),

algorithms (e.g., the best algorithm it reports on KITII [8]

3D detection benchmark is AVOD [31][IROS′18: 71.88] vs.

BtcDet [10][AAAI′22: 82.86] in this paper), and challenges,

which is not surprising as much progress has been made after

2018.

The aims of this paper are threefold. First, notice that

no recent literature exists to collect the growing knowledge

concerning 3D object detection, we aim to ﬁll this gap by

starting with several basic concepts, providing a glimpse

R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796

of evolution of 3D object detection, together with compre-

hensive comparisons on publicly available datasets being

manifested, with pros and cons being judiciously presented.

Witnessing the absence of a universal consensus on taxon-

omy with respect to 3D object detection, our second goal

is to contribute to the maturity of the taxonomy. To this

end, we are cautiously favorable to the taxonomy based

on input modality, approved by existing literature [13,14].

The idea of grouping literature based on their network ar-

chitecture derives from 2D object detection, which fails to

summarize the homogeneous properties among methods and

therefore results in overlapping in the subdivisions, e.g.,

multi-view based and BEV based are the same representa-

tion in essence. Another drawback is that several plug-and-

play module components can be integrated into either region

proposal based (two-stage) or single shot based (one-stage).

For instance, VoTr [5] proposes voxel transformer which can

be easily incorporated into voxel based one stage or two stage

detectors. Notice that diverse fusion variants consistently

emerge among 3D object detection, existing taxonomy in

works [13,14] needs to be extended. For instance, works [38,

38] are sequential fusion methods (see Sec. 3.3.1), which are

not well suited to existing taxonomy. We therefore deﬁne two

new paradigms, i.e. sequential fusion and parallel fusion,

to adapt to underlying changes and further discuss which

category each method belongs to explicitly, while works

[13,14] not. Also, we analyze in deep to provide a more

ﬁne-grained taxonomy above and beyond the existing eﬀorts

[13,14] on image based methods, say, result-lifting based

and feature-lifting based depending on intermediate rep-

resentation existence. Finally, we open point-voxel branch

to classify newly proposed variants, e.g. PV-RCNN [39].

By constrast, survey [35] directly groups PV-RCNN into

“Other Methods”, leaving the problem unsolved. Our third

goal aims to present a case study on ﬁfteen selected models

among surveyed works, with regard to runtime analysis,

error analysis, and robustness analysis closely. We argue that

what mainly restricts the performance of detection is 3D

location error based on our ﬁndings. Taken together, this

survey is expected to foster more follow-up literature in 3D

vision community.

The rest of this paper is organized as follows. Section 2

introduces background associated with foundations, sensors,

datasets, and performance metrics. Section 3reviews 3D

object detection methods with their corresponding pros and

cons in the context of autonomous driving. Comprehen-

sive comparisons of the state-of-the-arts are summarized

in Section 4. We conclude this paper and identify future

research directions Section 5. We also set up a regularly

updated project page on: https://github.com/rui-qian/SoTA-

3D-Object-Detection.

2. BACKGROUND

2.1. Foundations

Let denote input data, say, LiDAR signals or monoc-

ular images, denote a detector parameterized by Θ. Con-

sider an (𝐹+1)-dimensional result subset with 𝑛predictions,

denoted by 𝐲1, ..., 𝐲𝑛⊆ℝ𝐹+1, we have

Θ𝑀𝐿𝐸 =arg max

𝐲1, ..., 𝐲𝑛,Θ,(1)

where 𝐲𝑖=𝑖, 𝑠𝑖denotes a certain prediction of detector

(; Θ)with bounding box 𝑖∈ℝ𝐹and its probabilistic

score 𝑠𝑖∈[0,1]. In the context of autonomous driving, 𝑖is

usually parameterized as portrayed in Fig. 4, which indicates

the volume of the object of interest and its position relative to

a reference coordinate system that can be one of the sensors

equipped on a ego-vehicle. We notice that attributes encoded

by (d) in Fig. 4are orthogonal and therefore result in a more

lower information redundancy compared with (a), (b), (c). In

this paper, we adopt the form of (d) as most previous works

[25,24,6,40,41] do.

Figure 4: Comparisons of the 3D bounding box parameteriza-

tion, between 8 corners proposed in [30], 4 corners with heights

proposed in [31], the axis aligned box encoding proposed in

[42], and the 7 parameters for an oriented 3D bounding box

adopted in [25,24,6,40,41].

2.2. Sensors

We human beings leverage visual and auditory systems

to perceive the real world when driving, so how about

autonomous vehicles? If they were to drive like a human,

then to identify what they see on the road constantly is the

way to go. To this end, sensors matter. It is sensors that

empower vehicles a series of abilities: obstacles perception,

automatic emergency braking, and collision avoidance etc.

In general, the most commonly used sensors can be divided

into two categories: passive sensors and active sensors [43].

The on going debate among industry experts is whether or

not to just equip vehicles with camera systems (no LiDAR),

or deploy LiDAR together with on-board camera systems.

Given that camera is considered to be one of the typical

representatives of passive sensors, and LiDAR is regarded as

a representative of active ones, we ﬁrst introduce the basic

concepts of passive sensors and active sensors, then take

camera and LiDAR as examples to discuss how they serve

the autonomous driving system, together with pros and cons

being manifested in Table 1.

2.2.1. Passive Sensors

Passive sensors are anticipated to receive natural emis-

sions emanating from both the Earth’s surface and its at-

mosphere. These natural emissions could be natural light or

infrared rays. Typically, a camera directly grabs a bunch of

color points from the optics in the lens and arranges them

R. Qian, X. Lai and X. Li Pattern Recognition 130 (2022) 108796

Table 1

Advantages and disadvantages of diﬀerent sensors.

Sensors Advantages Disadvantages Publications

Passive

Monocular Camera •cheap and available for multiple situations

•informative color and texture attributes

•no depth or range detecting feature

•susceptible to weather and light conditions

[44], [15], [16],

[45], [17], [41]

Stereo Camera

•depth information provided

•informative color and texture attributes

•high frame rate

•computationally expensive

•sensitive to weather and light conditions

•limited Field-of-View

[18], [46], [47]

Active

LiDAR

•accurate depth or range detecting feature

•less aﬀected by external illumination

•360◦Field-of-View

•high sparseness and irregularity by nature

•no color and texture attributes

•expensive and critical deployment

[48], [49], [39],

[40], [50], [51],

[52], [53], [25],

[54], [6], [55], [24],

[38], [56], [57]

Solid State

LiDAR

•more reliable compared with surround view

sensors

•cost decrease

•error increase when diﬀerent points of view

are merged in real time

•still under development and limited Field-of-

View

n.a.

Table 2

A summary of publicly available datasets for 3D object detection in the context of autonomous driving. *: Numbers in brackets

indicate classes evaluated in their oﬃcial benchmarks.

Dataset Year Size Diversity Modality Benchmark Cites

♯Train ♯Val ♯Test ♯Boxes ♯Frames ♯Scenes ♯Classes* Night Rain Stereo Temporal LiDAR

KITTI [8,58] 2012 7,418×1 - 7,518 ×1 200K 15K 50 8 (3) No No Yes Yes Yes Yes 5011

Argoverse [37] 2019 39,384×7 15,062×7 12,507 ×7 993K 44K 113 15 Yes Yes Yes Yes Yes Yes 88

Lyft L5 [59] 2019 22,690×6 - 27,460 ×6 1.3M 46K 366 9 No No No Yes Yes No -

H3D [60] 2019 8,873×3 5,170×3 13,678 ×3 1.1M 27K 160 8 No No No Yes Yes No 31

Appllo [61,62] 2019 - - - - 140K 103 27 Yes Yes Yes Yes Yes Yes 78

nuScenes [37] 2019 28,130×6 6,019×6 6,008 ×6 1.4M 40K 1,000 23 (10) Yes Yes No Yes Yes Yes 225

Waymo [7] 2020 122,200×5 30,407×5 40,077 ×5 112M 200K 1,150 4 (3) Yes Yes No Yes Yes Yes 31

into an image array that is often referred to as a digital signal

for scene understanding. Primarily, a monocular camera

lends itself well to informative color and texture attributes,

better visual recognition of text from road signs, and high

frame rate at a negligible cost etc. Whereas, it is lack of depth

information, which is crucial for accurate location estimation

in the real 3D world. To overcome this issue, stereo cameras

use matching algorithms to align correspondences in both

left and right images for depth recovery [2]. While cameras

have shown potentials as a reliable vision system, it is hardly

suﬃcient as a standalone one given that a camera is prone

to degrade its accuracy in cases where luminosity is low at

night-time or rainy weather conditions occur. Consequently

equipping autonomous vehicles with an auxiliary sensor,

say active counterparts, to fall-back onto is necessary, in

case that camera system should malfunction or disconnect

in hazardous weather conditions.

2.2.2. Active Sensors

Active sensors are expected to measure reﬂected signals

that are transmitted by the sensor, which are bounced by the

Earth’s surface or its atmosphere. Typically, Light Detection

And Ranging (LiDAR) is a point-and-shoot device with

three basic components of lens, lasers and detectors, which

spits out light pulses that will bounce oﬀ the surroundings

in the form of 3D points, referred to as “point clouds”. High

sparseness and irregularity by nature and the absence of

texture attributes are the primary characteristics of a point

cloud, which is well distinguished from image array. Since

we have already known how fast light travels, the distance

of obstacles could be determined without eﬀort. LiDAR

system emits thousands of pulses that spin around in a circle

per second, with a 360 degree view of surroundings for

the vehicles being provided. For example, Velodyne HDL-

64L produces 120 thousand points per frame with a 10 Hz

frame rate. Obviously, LiDAR is less aﬀected by external

illumination conditions (e.g., at night-time), given that it

emits light pulses by itself. Although LiDAR system has

been hailed for high accuracy and reliability compared with

camera system, it does not always hold true. Speciﬁcally,

wavelength stability of LiDAR is susceptible to variations

in temperature, while adverse weather (e.g., snow or fog) is

prone to result in poor signal-to-noise ratio in the LiDAR’s

detector. Another issue with LiDAR is the high cost of

deployment. A conservative estimate according to Velodyne,

so far, is about $75,0001[14]. In the foreseeable future of

LiDAR, how to decrease cost and how to increase resolution

and range are where the whole community is to march

ahead. As for the former, the advent of solid state LiDAR is

expected to address this problem of cost decrease, with the

help of several stationary lasers that emit light pulses along

a ﬁxed ﬁeld of view. As for the latter, the newly announced

Velodyne VLS-128 featuring 128 laser pulses and 300m

radius range has been on sale, which is going to signiﬁcantly

facilitate better recognition and tracking in terms of public

safety.

1http://www.woodsidecap.com

1 / 19 100%

Documents connexes

00 Template IJECE FEB 2019

Virtual Navigator Gynecology White Paper

Glossary

EfficientDet: Scalable Object Detection Research Paper

3-2019%20Application%20Form%20Part%201

Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans l'interface ou les textes ? Ou savez-vous comment améliorer l'interface utilisateur de StudyLib ? N'hésitez pas à envoyer vos suggestions. C'est très important pour nous!

GDPR Confidentialité Conditions d''utilisation

3D Object Detection for Autonomous Driving: A Survey

Documents connexes

Faire une suggestion

Produits

Assistance

Produits

Assistance

3D Object Detection for Autonomous Driving: A Survey

Documents connexes

Faire une suggestion

Produits

Assistance

Ajouter ce document à la (aux) collections

Ajouter ce document à enregistré

Suggérez-nous comment améliorer StudyLib