Telechargé par adrien9999

m2r-igi-machinelearning1

publicité
M2R IGI!
Cours « Méthodes avancées en Image et Vidéo »
!
!
Machine Learning 1!
!
Christian Wolf!
INSA-Lyon!
LIRIS!
Méthodes avancées en image et vidéo!
C. Wolf!
Doua!
Machine Learning 1 (introduction, classification)!
C. Wolf!
Doua!
Machine Learning 2 (features, deep learning,
modèles graphiques)!
M. Ardabilian!
ECL!
Indexation par le contenu : évaluation des techniques
et systèmes à base d’images!
M. Ardabilian!
ECL!
Acquisition multi-image et applications!
M. Ardabilian!
ECL!
Approches de super-résolution et applications!
M. Ardabilian!
ECL!
Méthodes de fusion en imagerie!
M. Ardabilian!
ECL!
Détection, analyse et reconnaissance d’objets!
J. Mille!
Doua!
Méthodes variationnelles!
M. Ardabilian!
ECL!
La biométrie faciale!
E. Dellandréa!
ECL!
Le son au sein de données multimédia : codage,
compression et analyse!
S. Duffner!
Doua!
Suivi d’objets!
E. Dellandréa!
ECL!
Un exemple d'analyse audio : Reconnaissance de
l'émotion à partir de signaux de parole et de musique!
S. Bres!
Doua!
Indexation d’images et de vidéos!
V. Eglin!
Doua!
Analyse de documents numériques 1!
F. LeBourgeois!
Doua!
Analyse de documents numériques 2!
Sommaire!
Introduction!
1
0
1
0
x
1
cope of this book.
eeds its own tools and techniques, many of the
ommon to all such problems. One of the main
in a relatively informal way, several of the most
illustrate them using simple examples. Later in
deas re-emerge in the context of more sophistieal-world pattern recognition applications. This
ed introduction to three important tools that will
y probability theory, decision theory, and inforght sound like daunting topics, they are in fact
anding of them is essential if machine learning
ect in practical applications.
rve Fitting
egression problem, which we shall use as a runer to motivate a number of key concepts. Supvariable x and we wish to use this observation to
rget variable t. For the present purposes, it is inmple using synthetically generated data because
at generated the data for comparison against any
xample is generated from the function sin(2πx)
rget values, as described in detail in Appendix A.
a training set comprising N observations of x,
r with corresponding observations of the values
ure 1.2 shows a plot of a training set comprising
ta set x in Figure 1.2 was generated by choosspaced uniformly in range [0, 1], and the target
puting the corresponding values of the function
– Principes générales!
– Fitting et généralisation, complexité des modèles!
– Minimisation du risque empirique!
Classification supervisée!
–
–
–
–
–
KPPV (k plus proches voisins)!
Modèles linéaires pour la classification!
Réseaux de neurones!
Arbres de décisions et forêts aléatoires!
SVM (Support Vector machines)!
Extraction de caractéristiques!
– Manuelle!
– ACP (Analyse de composantes principales)!
– Réseaux de neurones convolutionnels !
(« deep learning »)!
Quelques sources!
Christopher M. Bishop!
“Pattern Recognition and Machine Learning”!
Springer Verlag, 2006!
Microsoft Research, !
Cambridge, UK Kevin P. Murphy, “Machine Learning”!
MIT Press, 2013!
Google!
Formerly : University of British Columbia, !
Reconnaissance de formes!
Classification supervisée!
Luc!
Boite!
Lego!
Reconnaissance d’objets à partir d’images:!
• Chiffres, lettres, logos!
• Biométrie: visages, empreintes, iris, …!
• Classes d’objets (voitures, arbres, bateaux, …)!
• Objects spécifiques (une voiture, un mug, …)!
Prédiction!
De manière générale, nous aimerions prédire une valeur t à partir
d’une observation x!
!
!
!
Si t est continue :
Si t est discret : #
#régression!
#classification!
!
Apprentissage des paramètres
à partir de données.!
Apprentissage et généralisation!
Apprendre à classifier des données revient à apprendre une fonction
de décision : la frontière entre les classes.!
Apprentissage et généralisation!
La complexité d’une fonction de décision dépend de la complexité du
regroupement des étiquettes dans l’espace de caractéristiques
(feature space)!
4 1 4 0 7 2 3 6 5 9 Fitting et Généralisation!
•
•
Données générées par une fonction!
Objectif : supposant la fonction inconnue, prédire t à partir de x!
ng data set of N =
wn as blue circles,
ng an observation
riable x along with
ding target variable
curve shows the
πx) used to generOur goal is to preof t for some new
hout knowledge of
e.
1
t
0
−1
0
ment lies beyond the scope of this book.
x
1
[C. Bishop, Pattern recognition and Machine learning, 2006]!
ial function
of the form manner,
and quantitative
and decision theory, discussed in Section 1.5, allows us to
exploit this probabilistic representation in
Morder to make predictions that are optimal
"
according
criteria.
w1 x + w2 x2 +
. . . + wM xM =
wj xj
(1.1)
y(x, w) = wto
0 +appropriate
For the moment, however, we shallj =0
proceed rather informally and consider a
based and
on xcurve
fitting.
Intoparticular,
j
denotes
x raised
the power ofwe
j. shall fit the data using a
is simple
the orderapproach
of the polynomial,
«coefficients
Fittingfunction
»wd’un
polynôme
polynomial
of the formd’ordre M!
nomial
0 , . . . , wM are collectively denoted by the vector w.
Fitting et Généralisation!
, although the polynomial function y(x, w) is a nonlinear function of x, it
M
r function of the coefficients w. Functions, such as the polynomial, which
"
M
in the unknown parameters
properties 2and are called linear
=
wj xj
(1.1)
y(x, w) have
= wimportant
0 + w1 x + w2 x + . . . + wM x
nd will be discussed extensively in Chapters 3 and 4.
j =0
values of the coefficients will be determined by fitting the polynomial to the
data. This can be done by minimizing an error function that measures the
denotes
x raised to the power of j.
where
M is the
order
ofany
thegiven
polynomial,
x»j training
6des
tween
the
function
y(x,
w),1.«INTRODUCTION
formoindres
valuecarrées
of w, and
and the
Critère
(des set
erreurs)!
ts. The
One simple
choice ofcoefficients
error function,
which
widely
used,
is given by denoted by the vector w.
,
w
are
collectively
polynomial
w
0 , . . . is
M
1.3 The error function (1.2) correfor w)
eachisdata
of the
squares
the errorsFigure
between
the predictions
y(xn , ofw)
n
sponds to (one half
of) the sum
t
Note
that,ofalthough
the polynomial
function
y(x,
a nonlinear tfunction
of x, it
of the
and the corresponding target valuesthe
tnsquares
, so that
wedisplacements
minimize
by the vertical w.
green Functions,
bars)
is a linear function of the (shown
coefficients
such as the polynomial, which
of each data point from the function
N
y(x, w).
are linear in the unknown
parameters
have important properties and y(x
are, w)called linear
1"
2
{y(xn , w)
− tn }
models andE(w)
will=be2 discussed
extensively
in Chapters 3(1.2)
and 4.
n=1
The values of the coefficients will be determined by fitting the polynomial to the
e factor
of 1/2data.
is included
We shall discuss
the mo-function that measures the
training
This for
canlater
beconvenience.
done by minimizing
an error
or this choice of error function later in this chapter. For the moment we
misfit between the function y(x, w), for any given value of w, and the xtraining set
ote that it is a nonnegative quantity that would be zero if, and only if, the
x
data points. One simple choice of error function, which is widely used, is given by
Dérivée linéaire -> solution directe!
for each data
the sum of the squares of function
the errors
between
thethrough
predictions
y(x
y(x, w) were
to pass exactly
each training data
point.
The geometn , w)
rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3.
target values tn ,[C.soBishop,
thatPattern
we minimize
point xn and the corresponding
recognition and Machine learning, 2006]!
n
n
We can solve the curve fitting problem by choosing the value of w for which
Sélection d’un modèle!
Quel ordre M pour le polynôme?!
Underfitting!
Underfitting!
M =0
1
M =1
1
t
7
1.1. Example: Polynomial Curve Fitting
t
0
0
−1
−1
0
x
1
0
x
1
Overfitting!
M =3
1
M =9
1
t
t
0
0
−1
−1
0
x
1
0
x
1
[C. Bishop, Pattern recognition and Machine learning, 2006]!
Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown in
Figure 1.2.
0
Sélection d’un modèle!
−1
Séparation des données en deux parties :!
• Base d’apprentissage (Training set)!
•1 Base de0 test (test set)!
x 1. INTRODUCTION
x
1
ous orders M , shown as red curves, fitted to the data set shown in
Figure 1.5
ERMS
!
= 2E(w⋆ )/N
1
Training
Test
ERMS
by
Graphs of the root-mean-square
error, defined by (1.3), evaluated
on the training set and on an independent test set for various values
Root
Mean
Square Error (RMS)!
of M
.
0.5
(1.3)
n by N allows us to compare different sizes of data sets on
the square root ensures that ERMS is measured on the same
me units) as the target variable t. Graphs of the training and
0
are shown, for various values of M , in Figure
1.5. The test
0
3 M
6
9
e of how well we are doing in predicting the values of t for
s of x. We note from Figure 1.5 that small values of M give
M =and
9, the
setattributed
error goes to
as wethat
might expect because
s of the test setForerror,
thistraining
can be
tozero,
the fact
[C. Bishop, Pattern recognition and Machine learning, 2006]!
contains 10
degrees
freedom corresponding
to the 10 coefficients
lynomials this
arepolynomial
rather inflexible
and
are of
incapable
of capturing
Validation croisée!
La séparation des données en deux parties change itérativement. La
mesure de performance est la moyenne sur toutes les itérations.!
1.4. The Curse of Dimensionality
cross-validation, illusof S = 4, involves taknd partitioning it into S
case these are of equal
groups are used to train
hen evaluated on the recedure is then repeated
s for the held-out group,
blocks, and the perforruns are then averaged.
33
run 1
run 2
run 3
run 4
nce. When data is particularly scarce, it may be appropriate
[C. which
Bishop, Pattern
recognition and Machine learning, 2006]!
= N , where N is the total number of data points,
gives
Big Data!!
Le problème de l’overfitting diminue en augmentant la taille de la base
d’apprentissage.!
9
1.1. Example: Polynomial Curve Fitting
N = 15
1
N = 100
1
t
t
0
0
−1
−1
0
x
1
0
x
1
Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9
polynomial
M=9! for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the
size of the data set reduces the over-fitting problem.
ing polynomial function matches each of the data points exactly, but between data
[C. Bishop, Pattern recognition and Machine learning, 2006]!
points (particularly near the ends of the range)
the function exhibits the large oscillations observed in Figure 1.4. Intuitively, what is happening is that the more flexible
Régularisation!
may wish to use relatively complex and flexible models. One technique that is often
used to control the over-fitting phenomenon in such cases is that of regularization,
which involves adding a penalty term to the error function (1.2) in order to discourage
Ajouter un terme supplémentaire : restriction des paramètres du
the coefficients from reaching large values. The simplest such penalty term takes the
modèle
form
of a sum! of squares of all of the coefficients, leading to a modified error function
Paramètre de régularisation!
of the form
N
"
1
λ
2
!
E(w)
=
{y(xn , w) − tn } + ∥w∥2
(1.4)
2
2
10
1. INTRODUCTION
2
∥w∥ ≡ wT w = w02
n=1
w0 est souvent exclu !
2
+ w12 + . . . + wM
, and the coefficient λ governs the relwhere
ative importance of the regularization term compared with the sum-of-squares error
because its
term. 1Note that often the lncoefficient
w0 is1omitted from the regularizer
λ = −18
ln λ = 0
inclusion
causes the results to depend on the
choice of origin for the target variable
t
t
(Hastie et al., 2001), or it may be included but with its own regularization coefficient
0
(we shall
discuss this topic in more detail in0 Section 5.5.1). Again, the error function
in (1.4) can be minimized exactly in closed form. Techniques such as this are known
in the−1statistics literature as shrinkage methods
because they reduce the value of the
−1
coefficients. The particular case of a quadratic regularizer is called ridge regression (Hoerl
and Kennard, 1970). xIn the
of neural networks, this
approach
is
0
1 context
0
1
x
known as weight decay.
Figure
M=9!1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error
Figure
1.7
shows
of parameter
fitting the
polynomial
of−18
order
=The9 to the
function
(1.4)
for two
values the
of theresults
regularization
λ corresponding
to ln λ =
and lnM
λ = 0.
[C.
Bishop,
Pattern
recognition
and
Machine
of noset
regularizer,
i.e., λ =but
0, corresponding
to lnthe
λ = −∞,
is shown at the
bottomfunction
right of Figure
1.4.
samecase
data
as before
now using
regularized
error
given
bylearning,
(1.4).2006]!
La classification supervisée!
Nouvelle entrée!
Classifieur!
Base d’apprentissage (étiquetée)!
Modèle!
Algorithme
d’apprentissage!
Classe!
La classification supervisée!
44
kppV (k plus proches voisins)!
1. INTRODUCTION
5
4
class densities
!
1.2
p(x|C2 )
p(C1 |x)
1
p(C2 |x)
Classifieurs génératifs linéaires
(Bayésiens)!
0.8
3
0.6
2
0.4
p(x|C1 )
1
0
0.2
0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
x
x
0.6
0.8
!
1
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
!
Réseaux de neurones!
!
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
KERNEL MACHINES
Forêts aléatoires!
𝑃 (𝑐)
𝑃
!
stration of the ν-SVM for reession applied to the sinusoidal
osses
indicates
the
nthetic data
set using
Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
1
predicted
regression of trees. Each tree consists of split nodes (blue) and leaf nodes
enels.
theThe
offset
pixels
ve is shown by the red line, and t
atures
give tube
a large
ϵ-insensitive
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features
at new
0 by different trees for a particular input.
are shown in green, taken
d those with support vectors
e indicated by blue circles.
to disambiguate
SVM (Support Vector Machines)!
3.3. Randomized decision forests
−1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
“k-plus proches voisins”!
Probablement le classifieur le plus simple.!
La base d’apprentissage (vecteurs
et étiquettes ) est stocké.!
Pour un nouveau vecteur , la donnée
la plus proche est
cherchée et son étiquette sert comme prédiction.!
où Dim 2 “k-plus proches voisins”!
4 1 5 0 4 2 7 Nouvelle donnée 6 3 9 Dim 1 Optimal dans la limite d’un nombre infini de données d’apprentissage.!
La classification supervisée!
44
kppV (k plus proches voisins)!
1. INTRODUCTION
5
4
class densities
!
1.2
p(x|C2 )
p(C1 |x)
1
p(C2 |x)
0.8
3
Classifieurs génératifs (Bayésiens)!
0.6
2
0.4
p(x|C1 )
1
0
0.2
0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
x
x
0.6
!
0.8
1
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
!
Réseaux de neurones!
!
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
KERNEL MACHINES
Forêts aléatoires!
𝑃 (𝑐)
𝑃
!
stration of the ν-SVM for reession applied to the sinusoidal
osses
indicates
the
nthetic data
set using
Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
1
predicted
regression of trees. Each tree consists of split nodes (blue) and leaf nodes
enels.
theThe
offset
pixels
ve is shown by the red line, and t
atures
give tube
a large
ϵ-insensitive
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features
at new
0 by different trees for a particular input.
are shown in green, taken
d those with support vectors
e indicated by blue circles.
to disambiguate
SVM (Support Vector Machines)!
3.3. Randomized decision forests
−1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
Modélisation probabiliste!
Supposons un problème de classification en deux classes
et!
à partir de données 1D modélisées par une variable aléatoire !
44
1. INTRODUCTION
(L’observation)!
!
Modélisation probabiliste :!
5
!
p(x|C2 )
4
!
Probabilité postérieure!
!
!
class densities
!
A notre disposition : la vraisemblance !
des données (likelihood)!
3
2
p(x|C1 )
1
0
0
0.2
0.4
0.6
x
0.8
1
[C. Bishop, Pattern recognition and Machine learning, 2006]!
ht expect probabilities to play a role in making decisions. When we ob
y image x for a new patient, our goal is to decide which of the two cl
gn to the image. We are interested in the probabilities of the two classe
Bayes’ theorem, these prob
mage,
which are given by p(Ck |x). Using
Pour obtenir la probabilité postérieure,
il faut inverser le modèle à l’aide
be expressed
inde
theBayes
form:!
de la réglé
Modélisation Bayesienne!
Vraisemblance!
!
!
44
!
1. INTRODUCTION
p(x|Ck )p(Ck )
.
p(Ck |x) =
p(x)
Probabilité a priori!
Peut être ignorée pour la maximisation!
La probabilité a priori modélise!
that any of the quantities appearing in Bayes’ theorem can be obtain
nos
connaissances sur!
5
1.2
marginalizing
or
with re
oint distribution
p(x, Ck ) by either
p(C1 |x)
p(C2conditioning
|x)
le résultat, indépendamment!
p(x|C2 )
ppropriate
variables.
We can now1 interpret p(Ck ) as the prior probability
4
des observations .!
posterior probability. Thus p(C1
Ck , and p(Ck |x) as the corresponding
0.8
Dans
les cas simples, elle!
3
s the probability that a person has0.6cancer, before we take the X-ray measu
est tabulée, e.g. !
44
1. INTRODUCTION
5
1
p(x|C2 )
class densities
class densities
4
0
3
0
2
0
p(x|C1 )
1
0
larly,2 p(C1 |x) is the corresponding probability, revised using Bayes’ the
p(x|C )
of the information
contained in0.4the X-ray. If our aim is to minimize the
1
Ici :!
0.2 intuitively we would choose
signing x to the wrong class, then
the class
higher00 posterior
probability.
We now
show
that
this
intuition
is correct,
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
x
x
discuss more general
criteria for making[C.decisions.
Bishop, Pattern recognition and Machine learning, 2006]!
0
0
0.2
0.4
0.6
0.8
1
x
1
Figure 1.27 Example of the class-conditional densities for tw
plot) together with the corresponding posterior probabilities (r
class-conditional density p(x|C1 ), shown in blue on the left plot,
vertical green line in the right plot shows the decision bounda
rate.
be of low accuracy, which is known as
1994; Tarassenko, 1995).
However, if we only wish to make
ful of computational resources, and ex
distribution p(x, Ck ) when in fact we
p(Ck |x), which can be obtained direc
conditional densities may contain a lo
terior probabilities, as illustrated in F
Génération de nouvelles données!
Cette famille de modèles porte le nom « modèles génératifs » parce
qu’elle permet de générer des nouvelles données respectant le
modèle.!
!
1. On échantillonne une classe
selon la distribution a priori !
e.g. !
!
!
!
2. Pour la classe donnée, on échantillonne l’observation selon la
vraisemblance !
44
e.g.!
1. INTRODUCTION
5
1.2
p(x|C2 )
class densities
4
p(C1 |x)
1
p(C2 |x)
0.8
3
0.6
2
0.4
p(x|C1 )
1
0
0.2
0
0.2
0.4
0.6
x
0.8
1
[C. 0Bishop, Pattern recognition and Machine learning, 2006]!
0
0.2
0.4
x
0.6
0.8
1
Probabilité d’erreur!
La probabilité de commettre une erreur d’un certain type peut être
calculée de manière directe :!
1. INTRODUCTION
!
!
x0
!
p(x, C1 )
!
x
!
p(x, C2 )
!
!
!
!
x
!
R1
!
Figure 1.24
!
!
!
En
R2
Schematic illustration of the joint probabilities p(x, Ck ) for each of two classes plotted
against
x, together
with de
the decision
boundary,xla
= zone
x
b. Values
of x ! x
best
are classified
as
changeant
le seuil
décision
rouge
compressible.!
class C2 and hence belong to decision region R2 , whereas points x < x
b are classified
as C1 and belong to R1 . Errors arise from the blue, green, and red regions, so that for
x<x
b the errors are due to points from class C2 being misclassified as C1 (represented by
the sum of the red and green regions), and conversely for points in the region x ! x
b the
[C. Bishop, Pattern recognition and Machine learning, 2006]!
errors are due to points from class C1 being misclassified as C2 (represented by the blue
region). As we vary the location x
b of the decision boundary, the combined areas of the
La classification supervisée!
44
kppV (k plus proches voisins)!
1. INTRODUCTION
5
4
class densities
!
1.2
p(x|C2 )
p(C1 |x)
1
p(C2 |x)
Classifieurs génératifs linéaires
(Bayésiens)!
0.8
3
0.6
2
0.4
p(x|C1 )
1
0
0.2
0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
x
x
0.6
0.8
!
1
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
!
Réseaux de neurones!
!
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
KERNEL MACHINES
Forêts aléatoires!
𝑃 (𝑐)
𝑃
!
stration of the ν-SVM for reession applied to the sinusoidal
osses
indicates
the
nthetic data
set using
Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
1
predicted
regression of trees. Each tree consists of split nodes (blue) and leaf nodes
enels.
theThe
offset
pixels
ve is shown by the red line, and t
atures
give tube
a large
ϵ-insensitive
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features
at new
0 by different trees for a particular input.
are shown in green, taken
d those with support vectors
e indicated by blue circles.
to disambiguate
SVM (Support Vector Machines)!
3.3. Randomized decision forests
−1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
∥w∥are hyperplanes. To simplify the disnamely those for which the decision surfaces
input
vector we
x and
assignsfirst
it tothe
one
of K
cussion,
consider
case
of
two
classes
and
then
investigate
the
extension
T
Multiplying
both
sides
of
this
result
by
w
and
adding
w
,
and
making
use
of
y(x) =
0
ll restrict attention to linear discriminants,
to
K
>
2
classes.
wT xare
+w
) =simplify
wT x⊥ the
+w
0 and y(x⊥To
0 = 0, we have
aces
hyperplanes.
disclasses 4.1.1
and then Two
investigate
the extension
classes
y(x)
Modèle linéaire pour la classification!
.
(4.7)
r=
The simplest representation of a linear
discriminant
function
is
obtained
by
tak∥w∥
fonction
de décision
estvector
modélisée
ingUne
a linear
function
of the input
so that par une fonction paramétrique :!
This result is illustrated in Figure 4.1.
r discriminant function is obtained by takx + w0 3, it is sometimes convenient
(4.4)
y(x) = in
wTChapter
As
with
the
linear
regression
models
that
o use a more compact notation in which we introduce an additional dummy ‘input’
a bias
to bethat
confused
with bias in
w isand
called
a weight
and
T where
0 is x
then
w
!vector,
= (w
, w)wles
and
!
= paramètres
(x(not
x +xw0 0=
(4.4)
wvalue
0dans
0 , x) so en
est1un
biais
quidefine
peut être
intégré
autres
ajoutant la
the constante
statistical«sense).
The negative of the bias is sometimes called a threshold. An
1 » aux entrées!
corassignedwith
to class
Cin
w
! Tx
!!. 0 and to class C2 otherwise. The
(4.8)
y(x)
=y(x)
a biasvector
(not toxbeisconfused
bias
1 if
0 is input
decision
is therefore
defined by the relation y(x) = 0, which
biasresponding
is sometimes
called boundary
a threshold.
An
Interprétation : !
In
case,to the
hyperplanes
passing through
x) this
!
0 and
class
Ca2(D
otherwise.
Theare
cor-D-dimensional
corresponds
todecision
−surfaces
1)-dimensional
hyperplane within
the D-dimensional
input
!
originby
of
therelation
D +two
1-dimensional
expanded
input
space.lie on the decision surface.
ehe
defined
the
y(x)
= 0,
and
xB both
of which
space.
Consider
points
xAwhich
yperplane
within
input
) =D-dimensional
y(xB ) = 0, we
have wT (xA − xB ) = 0 and hence the vector w is
Because
y(xAthe
4.1.2
classes
bothorthogonal
of whichMultiple
lie
thevector
decision
surface.
to on
every
lying
within the decision surface, and so w determines the
T
(xorientation
− xBconsider
) = 0ofand
hence
the vector
w isSimilarly,
ANow
the
extension
of linear
discriminants
K >on2 the
classes.
We surface,
might
the
decision
surface.
if x is atopoint
decision
surface,
so so
determines
the
bedecision
tempted
be=to0,and
build
awK-class
discriminant
by combining
a number
of two-class
then y(x)
and
the
normal
distance from
the origin to
the decision
surface is
La
constante
« 1the
» ajoute
rly,
if
x
is
a
point
on
decision
surface,
discriminant
functions. However, this leads to some serious difficulties (Duda and
given
by
un
«
biais
» to
au the
modèle !
e from1973)
the origin
Hart,
as we
nowdecision
show. surfacewisT x
w0
=−
. solves a two-class problem(4.5)
Consider the use of K −1 classifiers
each
of∥w∥
which
of
∥w∥
w
Modèles linéaires (2 classes)!
0
target vector t. One justification for using least squares in such a context is that it
for every possible
pair of classes.
This is
known
as atarget
one-versus-one
approximates
the conditional
expectation
E[t|x]
of the
values givenclassifier.
the input Each
point isFor
then
according
to athis
majority
vote amongst
theisdiscriminant
vector.
theclassified
binary coding
scheme,
conditional
expectation
given by thefunctions. of
However,
tooprobabilities.
runs into theUnfortunately,
problem of ambiguous
regions,
as illustrated
vector
posteriorthis
class
however, these
probabilities
in the
right-hand
diagramrather
of Figure
4.2.
are
typically
approximated
poorly,
indeed the approximations can have values
Wethecan
avoid
difficulties
by considering
singlemodel
K-class
discriminant
outside
range
(0, these
1), due
to the limited
flexibility
of a linear
as we
shall
Plusieurs
fonctions
paramétriques
:!
see
shortly. K linear functions of the form
comprising
!
model so that
Each
class Ck is described by its own linear
T
(4.9)
yk (x) = wk x + wk0
!
(4.13)
yk (x) = wkT x + wk0
!
and then assigning a point x to class Ck if yk (x) > yj (x) for all j ̸= k. The decision
Enk notation
et en
intégrant
lethese
biaistogether
:!
where
= between
1, . . . ,vectorielle,
K.class
We can
conveniently
using
vector
class
Cj group
is therefore
given
by
yk (x)
= ynotaboundary
Ck and
j (x) and
tion
so that
hence
corresponds to a (D − 1)-dimensional hyperplane defined by
" Tx
#
(4.14)
y(x) = W
Modèles linéaires (K classes)!
(wk − wj )T x + (wk0 − wj 0 ) = 0.
(4.10)
This has the same form as the decision boundary for the two-class case discussed in
Interprétation
Section 4.1.1, and so analogous geometrical properties
apply. : !
The decision regions of such a discriminant! are always singly connected and
convex. To see this, consider two points xA and x! B both of which lie inside decision
! that lies on the line connecting
region Rk , as illustrated in Figure 4.3. Any point x
« The winner takes it all »!
xA and xB can be expressed in the form
!
x
! = λxA + (1 − λ)xB
(4.11)
Problème simple (1D, 3 classes)!
Classifieur linéaire : entrées 1D, 3 classes!
Entrée :!
Paramètres :!
Sortie d’une classe :!
Sortie de toutes les classes :!
Problème simple (1D, 3 classes)!
Données d’apprentissage!
Frontières « réelles » entre classes!
Frontières estimées!
Le cas non-linéaire!
Prétraitement : transformation non-linéaire des données !
(à choisir préalablement selon l’application) :!
204
4. LINEAR MODELS FOR CLASSIFICATION
1
1
φ2
x2
0
0.5
−1
0
−1
0
x1
1
0
0.5
φ1
1
Figure 4.12 Illustration of the role of nonlinear basis functions in linear classification models. The left plot
shows the original input space (x1 , x2 ) together with data points from two classes labelled red and blue. Two
‘Gaussian’ basis functions φ1 (x) and φ2 (x) are defined in this space with centres shown by the green crosses
and with contours shown by the green circles. The right-hand plot shows the corresponding feature space
(φ1 , φ2 ) together with the linear decision boundary obtained given by a logistic regression model of the form
discussed in Section 4.3.2. This corresponds to a nonlinear decision boundary in the original input space,
shown by the black curve in the left-hand plot.
Fonctions de bases Gaussiennes!
[C. Bishop, Pattern recognition and Machine learning, 2006]!
an understanding of their more complex
4.3.2
Logistic regression
We begin our treatment of generalized linear models by considering the problem
of two-class classification. In our discussion of generative approaches in Section 4.2,
we saw that under rather general assumptions, the posterior probability of class C1
Modèle
(sur entrées
transformées)
+ function
non-linéarité
la sortie!
can be
writtenlinéaire
as a logistic
sigmoid acting
on a linear
of theàfeature
vector
linear
models by considering the problem
φ so ! that
! T "
= σ w4.2,
φ
(4.87)
p(C1 |φ) =
n of generative
approaches
iny(φ)
Section
!
= 1 − p(Cprobability
is the
logistic
with ! p(C
ions,
the2 |φ)
posterior
class
C1sigmoid function defined by
1 |φ). Here σ(·) of
Modélisation
directe de this
la
(4.59). In the terminology
of
statistics,
model is known as logistic regression,
probabilité
postérieure!
on
a
linear
function
of
the
feature
vector
!
although it should be emphasized that this is a model for classification rather than
ROBABILITY
! ! DISTRIBUTIONS
regression.
"
T
an
feature space
φ, this (4.87)
model
has M adjustable
parameters.
la -dimensional
fonction logistique
(« sigmoïde
») assurant
que les sorties
sont
) = For
σ est
wM
φ
Bywhich
contrast,
if
we
fitted
Gaussian
class
conditional densities using maximum
entre
et solve
1! had
we0can
for µ
to give
µ = σ(η),
where
we would
have used
2M parameters
forby
the means and M (M + 1)/2
s likelihood,
the logistic
sigmoid
function
defined
1
parameters for the (shared) covariance
prior p(C1 ),
(2.199)
σ(η) = matrix. Together with the class
1 + regression,
exp(−η)which grows quadratically with M ,
isthismodel
is known
as logistic
gives a total
of M (M +5)/2+1
parameters,
to logistic
thefor
linear
dependence
M
thewrite
number
of parameters
in logistic
s iniscontrast
model
classification
rather
than
isacalled
the
sigmoid
function. on
Thus
weofcan
the Bernoulli
distribution
regression.
For large
values of (2.194)
M , there
is form
a clear advantage in working with the
using the standard
representation
in the
logistic regression model directly.
p(x|η)
= σ(−η)
exp(ηx) the parameters of
(2.200)
, thisWemodel
has
M adjustable
parameters.
now use
maximum
likelihood
to
determine
the logistic
regression
To
this, =
weσ(−η),
shall make
use
of the
derivative
of theComlogistic sigwhere
wemodel.
have used
1do
− σ(η)
which
is
easily
proved
from (2.199).
ass
conditional
densities
using
maximum
moid
function,
whichshows
can conveniently
be expressed in terms of the sigmoid function
parison
with (2.194)
that
meters
for the means and M (M + 1)/2
itself
Régression logistique (2 classes)!
4.3. Probabilistic Discriminative Models
209
Régression logistique (K classes)!
4.3.4 Multiclass logistic regression
4.3.4
Multiclass
logistic models
regression
In
our discussion
of generative
for multiclass classification, we have
seen In
thatour
fordiscussion
a large class
distributions,
the for
posterior
probabilities
are given
a
of of
generative
models
multiclass
classification,
we by
have
softmax
ofau
linear
functions
feature variables,
so that
seen thattransformation
for similaire
a large class
of cas
distributions,
thethe
posterior
probabilities
are given by a
Extension
linéaire!of
softmax transformation of linear functions of theexp(a
feature) variables, so that
k
(4.104)
p(Ck |φ) = yk (φ) = !
exp(a
)
)
exp(a
« Softmax » pour assurer que
kj
(4.104)
p(Ck |φ) = yk (φ) = !j
les sorties sont
des
exp(a
)
j
j
probabilités
!
where the ‘activations’ ak are given by
where the ‘activations’ ak are given by
ak = wkT φ.
Linéarités! (4.105)
ak = wkT φ.
(4.105)
There we used maximum likelihood to determine separately the class-conditional
densities
the maximum
class priorslikelihood
and then found
the corresponding
probabilities
There weand
used
to determine
separatelyposterior
the class-conditional
using
Bayes’
implicitly
determining
the parameters
{wkprobabilities
}. Here we
densities
andtheorem,
the class thereby
priors and
then found
the corresponding
posterior
consider the use of maximum likelihood to determine the parameters {wk } of this
using Bayes’ theorem, thereby implicitly determining the parameters {wk }. Here we
to all of
model directly. To do this, we will require the derivatives of yk with respect
consider the use of maximum likelihood to determine the parameters {wk } of this
the activations aj . These are given by
model directly. To do this, we will require the derivatives of yk with respect to all of
the activations aj . These are given
∂yk by
= yk (Ikj − yj )
(4.106)
∂a
j
∂yk
= yk (Ikj − yj )
(4.106)
∂a
j
where Ikj are the elements of the identity
matrix.
Next we write down the likelihood function. This is most easily done using
where
Ikj are
the elements
identity
matrix.vector t for a feature vector φ
the 1-of-K
coding
scheme of
in the
which
the target
n
n
Next
we
write
down
the
likelihood
function.
This
is
most
easily
done
using
belonging to class Ck is a binary vector with all elements zero except for element k,
the 1-of-K
scheme
in which
the is
target
tn for a feature vector φn
which
equalscoding
one. The
likelihood
function
then vector
given by
Régression logistique : apprentissage!
Un ensemble de données d’apprentissage est supposé connu :!
Les entrées
, transformées par les fonctions des bases!
Les sorties
sont connues, codées en « 1-à-K » !
(« hot-one-encoded ») :!
!
!
!
Classe « réelle » de l’échantillon n!
!
!
!
!
!
!
Objectif : apprendre les paramètres
selon un critère d’optimalité!
j
∂ykk kj
= yk (Ikj − yj )
(4.106)
∂y∂a
k
j= yk (Ikj − yj )
(4.106)
where Ikj are the elements of the identity
∂aj matrix.
Next
we Iwrite
down the likelihood function. This is most easily done using
where
kj are the elements of the identity matrix.
where Next
Icoding
elements
ofthe
thelikelihood
identity
matrix.
for aisfeature
vectordone
φn using
the 1-of-K
scheme
in which
the target
vector tn This
kj are
wethe
write
down
function.
most easily
we write
thevector
likelihood
function.
This
done
belonging
to class
Ccoding
binary
with all
elements
zero isexcept
element
k,usingφ
k is adown
for for
aeasily
feature
vector
theNext
1-of-K
scheme
in which
the
target vector
tnmost
n
for
a
feature
vector
φ
thebelonging
1-of-K
scheme
in
which
the
target
vector
t
which
equals
one.coding
The
likelihood
function
is
then
given
by
La vraisemblance
des
données
est
la
probabilité
d’obtenir
les
n
n
to class Ck is a binary vector with all elements zero except for element k,
belonging
to
class
CkThe
is ales
binary
vector
with all
elements
zero
for element k,
which étant
equals
one.
likelihood
function
isetthen
by except
observations
donné
paramètres
lesNgiven
entrées
:!
N
K
K
" "function is then given
""
which equals one. The likelihood
by t
tnk
nk
p(C
|φ
)
=
y
(4.107)
p(T|w1 , . . . , wK ) =
k
N
K
N
K
n
nk
"
"
"
"
N "
K
N "
K
tnk
=1 =
k"
=1
k"
=1
)=
p(Ck |φntnk
)ntnk
ynk
(4.107)
p(T|w1 , . . . , wnK=1
tnk
= n=1 ynk
(4.107)
p(T|w1 , . . . , wK ) = n=1 p(Ck |φn )
k=1
×k=1
Kk=1
matrix of target nvariables
with elements
where ynk = yk (φn ), and T is an N
n=1
=1 k=1
negative
then gives× K matrix of target variables with elements
tnk . Taking
= paramètres
yk (φlogarithm
wherethe
ynk
Pour
estimer
les
on
minimisera
une
fonction
n ), and T is an, N
=
y
(φ
),
and
T
is
an
N
×
K
matrix
of
target
variables
with elements
where
y
nk
k
n
. Taking
the negative
logarithmde
thenlagives
tnk(le
N
K
d’erreur
logarithme
négatif
vraisemblance)
:!
##
tnk . Taking the negative logarithm then gives
ln ynk
(4.108)
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = −
Ntnk#
K
#
N K
k#
=1 #
=−
t ln ynk
(4.108)
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK )n=1
tnknk
ln ynk
(4.108)
E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = −
n=1 k=1
which is known as the cross-entropy error function for then=1
multiclass
classification
k=1
problem.
which is known
as the est
cross-entropy
error function
for the multiclass classification
CetteWe
fonction
de
cout
connue
lerespect
nom!
which
istake
known
as
the cross-entropy
errorsous
function
for
thetomulticlass
now
the
gradient
of
the
error
function
with
one of theclassification
paramproblem.
problem.wj . Making use
« of
Cross-entropy
loss
the result (4.106) for
the »!
derivatives of the softmax
eter vectors
We now take the gradient of the error function with respect to one of the paramWe
takegradient,
the gradient of the error
function minimisée
with respect to :!one of the param4.18
we now
obtain
Enfunction,
calculant
son
of thepeut
result être
(4.106) for the derivatives
of the softmax
eter vectors
wj . Making useelle
eter vectors wj . Making use of the result (4.106) for the derivatives of the softmax
rcise
4.18
function,
obtain
N
#
ise 4.18
function,
wewe
obtain
(ynjN− tnj ) φn
(4.109)
∇wj E(w1 , . . . , wK ) =
#
N
n=1 #
E(w
,
.
.
.
,
w
)
=
(4.109)
∇
1
K
nj − tnj ) φ
j
(y(y
(4.109)
∇wjwE(w
1 , . . . , wK ) =
nj − tnj ) φn n
∂aj
Apprentissage des paramètres!
[C. Bishop,
n=1 Pattern recognition and Machine learning, 2006]!
n=1
Application : détection d’objets simples!
Une fenêtre de taille 20x20 est glissée sur l’image. !
Les pixels d’une fenêtre sont donnés comme entrées!
[A. Mihoub, C. Wolf, G. Bailly, 2014]!
La classification supervisée!
44
kppV (k plus proches voisins)!
1. INTRODUCTION
5
4
class densities
!
1.2
p(x|C2 )
p(C1 |x)
1
p(C2 |x)
Classifieurs génératifs linéaires
(Bayésiens)!
0.8
3
0.6
2
0.4
p(x|C1 )
1
0
0.2
0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
x
x
0.6
0.8
!
1
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
!
Réseaux de neurones!
!
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
KERNEL MACHINES
Forêts aléatoires!
𝑃 (𝑐)
𝑃
!
stration of the ν-SVM for reession applied to the sinusoidal
osses
indicates
the
nthetic data
set using
Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
1
predicted
regression of trees. Each tree consists of split nodes (blue) and leaf nodes
enels.
theThe
offset
pixels
ve is shown by the red line, and t
atures
give tube
a large
ϵ-insensitive
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features
at new
0 by different trees for a particular input.
are shown in green, taken
d those with support vectors
e indicated by blue circles.
to disambiguate
SVM (Support Vector Machines)!
3.3. Randomized decision forests
−1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
5.1. Feed-forward Network Functions
Réseaux de neurones!
227
5.1.Régression
Feed-forward
Network
Functions
logistique
: les fonctions
de bases sont limitées, à choisir
manuellement
pour
chaque and
application.!
The linear models
for regression
classification discussed in Chapters 3 and 4, rebased on linear combinations of fixed nonlinear basis functions φj (x)
Et spectively,
si on les are
apprenait?!
and take the form
!
y(x, w) = f
!
!M
"
wj φj (x)
j =1
!
#
(5.1)
where f (·) is a nonlinear activation function in the case of classification and is the
Si les fonctions de base sont de la même forme que les fonctions des
identity in the case of regression. Our goal is to extend this model by making the
sortie
basis:! functions φj (x) depend on parameters and then to allow these parameters to
be adjusted, along with the coefficients {wj }, during training. There are, of course,
many ways to construct parametric nonlinear basis functions. Neural networks use
! basis functions that follow the same form as (5.1), so that each basis function is itself
! a nonlinear function of a linear combination of the inputs, where the coefficients in
! the linear combination are adaptive parameters.
This leads to the basic neural network model, which can be described a series
f etofhfunctional
sont des
fonctions non-linéaires
(« fonctions
d’activations
transformations.
First we construct
M linear combinations
of the»).
input
form
variables x1 , . . . ,ilxs’agit
Généralement,
sigmoid, de tanh ou de softmax.!
D in thede
!
!
!
aj =
D
"
(1)
(1)
Pattern recognition and Machine(5.2)
learning, 2006]!
wji xi +[C.wBishop,
j0
the set of weight parameters by defining an additional input variable x0 whose value
is clamped at x0 = 1, so that (5.2) takes the form
Réseaux de
neurones!
!
D
(1)
aj =
wji xi .
(5.8)
Cela donne un réseau de neurones à deux couches (une couche
i=0
cachée, une couche de sortie) :!
We can similarly absorb the second-layer biases into the second-layer weights, so
« Multi-layer
perceptron » (MLP)!
that the overall network
function becomes
"M
"D
##
! (2)
! (1)
yk (x, w) = σ
wkj h
j =0
wji xi
.
(5.9)
i=0
...!
As can be seen from Figure 5.1, the neural network model
comprises two stages
(Les biais ont été intégrés)!
of processing, each of which resembles the perceptron model of Section 4.1.7, and
for this reason the neural network is also known as the multilayer perceptron, or
MLP. A key difference compared to the perceptron, however, is that the neural network uses continuous sigmoidal nonlinearities in the hidden units, whereas the perceptron uses step-function nonlinearities. This means that the neural network function is differentiable with respect to the network parameters, and this property will
play a central role in network training.
If the activation functions of all the hidden units in a network are taken to be
linear, then for any such network we can always find an equivalent network without
hidden units. This follows from the fact that the composition of successive linear
Couche However,
Coucheif the number of hidden
Couche
transformations is itself
a linear transformation.
cachée!
de sortie!
units is smaller thand’entrée!
either the number of
input or output
units, then the transforma-
Quelques remarques!
•
Le nombre d’unités dans la couche caché est adaptable.!
•
Plusieurs couches cachées sont possibles. Nous traiterons les type
feed-forward : absence de cycles dans les connexions.!
•
Les fonctions de sorties sont dérivables par rapport aux paramètres
du réseau.!
•
Si les fonctions d’activations sont linéaires, alors un réseau
équivalent peut être trouvé sans couche cachée.!
•
Les réseaux de neurones peuvent approximer n’importe quelle
fonction à une précision arbitraire si la quantité d’unités est
suffisante.!
..
.!
5. NEURAL NETWORKS
gure 5.4
Un exemple!
Example of the solution of a simple two- 3
class classification problem involving
synthetic data using a neural network
having two inputs, two hidden units with 2
‘tanh’ activation functions, and a single
output having a logistic sigmoid activa- 1
tion function. The dashed blue lines
show the z = 0.5 contours for each of 0
the hidden units, and the red line shows
the y = 0.5 decision surface for the net- −1
work. For comparison, the green line
denotes the optimal decision boundary −2
computed from the distributions used to
generate the data.
−2
−1
0
1
2
symmetries, and thus any given weight vector will be one of a set 2M equivalent
Bishop,
weight[C.
vectors
. Pattern recognition and Machine learning, 2006]!
Similarly, imagine that we interchange the values of all of the weights (and the
bias) leading both into and out of a particular hidden unit with the corresponding
values of the weights (and bias) associated with a different hidden unit. Again, this
clearly leaves the network input–output mapping function unchanged, but it corresponds to a different choice of weight vector. For M hidden units, any given weight
36
be viewed as performing a nonlinear feature extraction, and the sharing of features
between the different
save on
computation
can also
lead the
to improved
evaluations,
each ofoutputs
whichcan
would
require
O(W )and
steps.
Thus,
computational
generalization.
effort
needed to find the minimum using such an approach would be O(W 3 ).
Finally,
we consider
the standard
multiclass
problem
in whichinformation.
each
Now
compare
this with
an algorithm
thatclassification
makes use of
the gradient
input is assigned to one of K mutually exclusive classes. The binary target variables
Because
each evaluation of ∇E brings W items of information, we might hope to
tk ∈ {0, 1} have a 1-of-K coding scheme indicating the class, and the network
find
the are
minimum
of the
function
inp(t
O(W
)1|x),
gradient
evaluations.
As:! we
shall see,
La
fonction
d’erreur
w) =de
leading logistique
to the following
error
outputs
interpreted
as est
yk (x,celle
kla=régression
by
using error backpropagation, each such evaluation takes only O(W ) steps and so
function
NO(W
K 2 ) steps. For this reason, the use of gradient
the minimum can now be found in
!
!
w).training neural networks.
(5.24)
information forms theE(w)
basis =
of−practical talgorithms
kn ln yk (xn ,for
Apprentissage : !
descente de gradient!
n=1 k=1
5.2.4 Gradient descent optimization
Minimisation itérative par descente de gradient (un pas dans la
The simplest approach to using gradient information is to choose the weight
direction
du plus
grand achangement)
update in (5.27)
to comprise
small step in the:!direction of the negative gradient, so
that
w(τ +1) = w(τ ) − η∇E(w(τ ) )
(5.41)
where the parameter η > 0 is known as the learning
rate.(Learning
After each
Vitesse
rate)!such update, the
gradient is re-evaluated for the new weight vector and the process repeated. Note that
5. NEURAL NETWORKS
Possibilité
de blocage
un minimum
local
the
error function
is defined dans
with respect
to a training
set, :!and so each step requires
Figure 5.5 Geometrical view of the error function E(w) as
E(w)
thata the
entire training set be processed
in order to evaluate ∇E. Techniques that
surface sitting over weight space. Point w is
local minimum and w is the global minimum.
use Atathe
whole
set ofattheonce
are called batch methods. At each step the weight
any point
w , thedata
local gradient
error
surface is given by the vector ∇E.
vector
is moved in the direction of the greatest rate of decrease of the error function,
and so this approach is known as gradient descent or steepest descent. Although
such an approach might intuitively seem reasonable, in fact it turns out to be a poor
w
algorithm, for reasons discussed in Bishop and Nabney
(2008).
w
w
w
For batch optimization, there are more efficient
methods, such as conjugate graw
∇E [C. Bishop, Pattern recognition and Machine learning, 2006]!
dients and quasi-Newton methods,
which are
much more robust and much faster
A
B
C
1
A
2
B
C
e minimum
can now be found in O(W ) steps. For this reason, the use of grad
1999). Unlike gradient descent, these algorithms have the property that the error
dient-based
algorithm
multiple
eachthe
time
using
ahasdifferent
randomly
ormationfunction
forms
the basis
of atpractical
algorithms
for
training
neural
always
decreases
eachtimes,
iteration
unless
weight
vector
arrived
at a networks
or global
starting local
point,
andminimum.
comparing the resulting performance on an independen
In order to finddescent
a sufficiently optimization
good minimum, it may be necessary to run a
5.2.4
Gradient
ion
set. gradient-based
algorithm multiple times, each time using a different randomly choThere
is,senhowever,
version
ofperformance
gradient
that
has proved
starting
point,an
andon-line
comparing
the gradient
resulting
ondescent
an independent
valiThe simplest
approach
to using
information
is to
choose
the we
dation
set.
Batch
toutes
les données
sontstep
utilisées
pour
chaque
étape!
practice
for: There
training
neural
networks
on
large
data
sets
(Le
Cun et al., 1
date in
(5.27)
to comprise
a
small
in
the
direction
of
negative
is, however, an on-line version of gradient descent that hasthe
proved
useful gradien
!
based
maximum
likelihood
for sets
a set
observ
in practice
for on
training
neural networks
on large data
(Le of
Cunindependent
et al., 1989).
ator functions
τfor
+1) each
(τ ) point
(τ )
Error functions
basedone
on(maximum
likelihood
for a set of independent
observations
! a sum
mprise
of terms,
data
=
w
−
η∇E(w
)
(5
w
comprise a sum of terms, one for each data point
!
Deux stratégies : batch vs. en ligne!
here the parameter η > 0 is known as the
learning rate. After each such update
NN
!
!
!
En (w). and the process repeated.
(5.42)
E(w)
=
adient is re-evaluated for the new
weight
Note
En (w).
E(w)
=n=1vector
e error! function is defined with respect nto=1a training set, and so each step requ
On-line gradient descent, also known as sequential gradient descent or stochastic
at the! entire
training set be processed in order to evaluate ∇E. Techniques
gradient descent, makes an update to the weight vector based on one data point at a
ligne
(stochastic
gradient
: le gradient
est calculé
pour
gradient
descent,
also known
as batch
sequential
gradient
descent
or
e-line
the En
whole
are descent)
called
methods.
At each
stepun
thestoc
we
time,data
so
thatset at once
(
τ
+1)
(
τ
)
(
τ
)
pointin
(une
àwchaque
:! based of
=to
− weight
η∇Eitération
(w vector
).
(5.43)
w of the
adient
descent,
makes
an donnée)
update
the
onthe
oneerror
datafunc
poin
nrate
ctor isseul
moved
theseule
direction
greatest
of decrease
so! this
thatapproach is known as gradient descent or steepest descent. Altho
de,so
(τ )
w(τ )reasonable,
− η∇En (w
). it turns out to be a p
w(τ +1) =seem
ch an! approach might intuitively
in fact
gorithm,
for reasons discussed in Bishop and Nabney (2008).
!
For batch optimization, there are more efficient methods, such as conjugate
ents and quasi-Newton methods, which are much more robust and much fa
an simple gradient descent (Gill et al., 1981;[C. Fletcher,
1987; Nocedal and Wr
Bishop, Pattern recognition and Machine learning, 2006]!
error function is to use finite differences. This can be done by perturbing each weig
in turn, and approximating the derivatives by the expression
Calcul du gradient!
En (wji + ϵ) − En (wji )
∂En
+ O(ϵ)
=
La minimisation de∂w
l’erreur
nécessite
le
calcul
de
son
gradient !
ϵ
ji
(5.6
where ϵ ≪ 1. In a software simulation, the accuracy of the approximation to t
derivatives can be improved by making ϵ smaller, until numerical roundoff proble
arise.
The accuracy
of the finite
differences
method
can be improved
Chaque
dérivée partielle
pourrait
(en théorie)
se calculer
par une significan
by différence
using symmetrical
finie : ! central differences of the form
En (wji + ϵ) − En (wji − ϵ)
∂En
+ O(ϵ2 ).
=
∂wji
2ϵ
(5.6
Problème
: efficacité
très mauvaise.
Pour
donnée,by
pour
W expansion
In this
case, the
O(ϵ) corrections
cancel,
as chaque
can be verified
Taylor
paramètres (poids), nous avons 2*W « stimulations » à faire, chacune
2
the demandant
right-hand side
of
(5.69),
and
so
the
residual
corrections
are
O(ϵ
). The numb
2
W calculs -> O(W )!
of computational
steps is, however, roughly doubled compared with (5.68).
!
The main problem with numerical differentiation is that the highly desirab
O(W ) scaling has been lost. Each forward propagation requires O(W ) steps, a
..
.!
[C. Bishop, Pattern recognition and Machine learning, 2006]!
Here we shall consider the problem of evaluating ∇En (w) for one such te
n be accumulated
overNthe training
set
in the
casefor
ofsequential
batch methods.
error function.
This
may
be
used
directly
optimization, or th
!
linear com
Consider first
a simple
linear
model in which the (5.44)
outputs
yk are
E
E(w)
can
be=accumulated
case of batch
methods.
n (w). over the training set in the
n=1 xfirst
Consider
simple linear model in which the outputs yk are linear c
ons of the input variables
i soathat
tions
the input ∇E
variables
x!
i so that
consider the problem
ofof
evaluating
n (w) for one such term in the
= dewmanière
yk optimization,
est be
possible
de calculer
le gradient
directe.!
. ThisIlmay
used directly
for sequential
ori!
the results
ki x
Retro-propagation du gradient (1)!
wki xi
yk =
ulated Exemple
over the training
set
in
the
case
of
batch
methods.
très simple : fonction d’activation
linéaire
(à la sortie), erreur
i
i
first a de
simple
model in
which
the outputs
yk are linear combinatypelinear
« somme
des
carrés
» :!
together
with an error
that, for a particular
input pattern
n, takes
gether
withxan
that,function
for a particular
input pattern
n, takes
theth
that function
put variables
i soerror
!
!
1
!
2
1 En =
wki xi
yk =
(ynk(5.45)
2 − tnk )
(ynk2− tnk )
En =
i
2
k
..
.!
k
an error function that,
for
a
particular
input
pattern
n, takes
the
form
=
y
(x
,
w).
The
gradient
of
this
error function
with respect to
where
y
nk
k
n
Pour les poids !
de la couche de sortie, on peut donner
les dérivées
given
by gradient
= yEk (xw=nji:!,1isw).
The
of this error function
with respect to a
here ynk
directement
(ynk − tnk )2
(5.46)
n
∂En
2
ji is given by
= (ynj − tnj )xni
k
∂w
ji
∂E
n
yk (xn , w). The gradient
this
function
with
respect
to a weightinvolving the product of
=as(y
− tcomputation
whichof
can
beerror
interpreted
a nj
‘local’
nj )xni
y
ji
associated
with the output end of the link wji and the var
signal’ ynj − tnj∂w
∂En
associated
with
the
end of the link.involving
In(5.47)
Sectionthe
4.3.2,
we sawof
how
= (ynjas
− atnj‘local’
)xniinputcomputation
hich can be interpreted
product
an
∂wformula
ji
ariseswith
with the
logistic
sigmoid
activation
function
together
with
−
t
associated
the
output
end
of
the
link
w
and
the
variab
gnal’
y
nj as a nj
interpreted
‘local’
computation
involving
productfor
of an
entropy
error function,
andthe
similarly
the‘error
softmax ji
activation function
sociated
withwith
the
end
ofcross-entropy
the
In
Section
we saw how a s
tnj associated
theinput
output
end of
the
linklink.
wji and
the
variable4.3.2,
xWe
ni shall now see how thi
with
its matching
error
function.
th the input
endwith
ofresult
thethe
link.
In Section
4.3.2,
we
sawsetting
how aof
similar
rmula
arises
logistic
sigmoid
activation
function
together
withnetw
the
extends
to the
more
complex
multilayer
feed-forward
[C. Bishop, Pattern recognition and Machine learning, 2006]!
s with the logistic sigmoid
function together
witheach
the cross
In a activation
general feed-forward
network,
unit computes a weighted su
5.3. Error
Backpropagation
243
ow yofvector
information
through
the
network.
Substituting
(5.51)
and
(5.52)
into
(5.50),
we
then
obtain
input
to
the
network
and
calculated
the
activations
of
all
of
−
t
associated
with
the
output
end
of
the
link
w
and
the
variable
x
nal’
nj
nj
ji the input endniof the link. In
associated
with
with
respect
to aa weight
ider
the
evaluation
of
the
derivative
of
E
n 4.3.2,
output with
units the
in the
network
application
of we
(5.48)
ociated
input
end ofby
thesuccessive
link. In Section
sawand
how
similar
formula
arises
with
the
logistic
activ
∂E
of
various
units
will
depend
onaoutput-unit
the
n. sigmoid
ninput
ocess
is the
often
called
propagation
because
itparticular
be
regarded
mula
arises
with
the forward
logistic
sigmoid
activation
function
together
withpattern
the
cross
eutsusing
the
canonical
link
as
the
activation
function.
To
where
zi is the
activation
of
unit,
orcan
input,
that
sends
a
connection
to
unit
j,
and
wj
= δj zi .
w
of error
information
through
the associated
network.
opy
function,
and
similarly
forentropy
the
softmax
activation
function
together
rder
tohidden
keep
the
uncluttered,
weuse
shall
omit
the
subscript
n sawpartial
∂w
error
function,
and
for the
isunits,
thenotation
weight
with
that
connection.
In
Section
5.1,similarly
we
that biases
can
ji
shder
for
we
again
make
of
the
chain
rule
for
with
to
asee
weight
evaluation
of
the
of E
Eby
itsthe
matching
cross-entropy
error
function.
We respect
shallon
now
how
simple
the
weight
wji
only
ork
variables.
First
wederivative
note
that
be
included
in
this
sum
introducing
an
extra
unit,this
or
input,
with
activation
fixed
nn depends
with
its
matching
cross-entropy
error
functio
ult
theatmore
setting
of
multilayer
feed-forward
networks.
ts extends
of
the
various
units
will
depend
ontells
particular
input
pattern
n. explicitly.
Il manque
lesj.complex
dérivées
par
rapport
aux
paramètres
despartial
couches
Equation
(5.53)
us
that
thedeal
required
derivative
is obtained
by ism
+1.
We
therefore
dothe
not
need
to
with
biases
The sumsimply
in (5.48)
We
can
therefore
apply
the
chain
rule
for
ed
input
ato
!
j to unit
∂a
∂E
result
extends
to
the
more
setting
In atogeneral
feed-forward
network,
each
unit
computes
aunité
weighted
sum
of
its
der
keep the
notation
uncluttered,
we ∂E
shall
omit
subscript
nofale
nat
kthe
the value
δnonlinear
for
the
unit
theune
output
end
ofntothe
weight
by the
value
ofofzj
cachées.
Rappelons
calcul
fait
par
caché
:!complex
of
unit
transformed
by
activation
function
h(·)
give
the
activation
z
j
give
δjwe
≡the
=
(5.55)
uts
of the formFirst
depends
ona the
weight w
only
rk variables.
noteinput
that end
En !
ji
at
of
the
weight
(where
z
=
1
in
the
case
of
a
bias).
Note
tha
In
general
feed-forward
network,
each
in
the
form
∂a
∂a
∂a
∂a
∂E
∂E
!
j
k
j
j
n
n
Wesame
can=
therefore
chainz rule
for partial
d input aj to unit j.the
wjithe
(5.48) at the start(5.49
aj = askapply
.zthe
(5.50)
i simple
=
h(a
).
form
for
linear
considered
of th
j form
j model
inputs
of
the
∂wjiin order
∂aj to∂w
ive
ji
!the
i evaluate
Thus,
the
derivatives,
we
need
only
to
calculate
runs over
all
units
k
to
which
unit
j
sends
connections.
The
arrangein
the
sum
in
(5.48)
could
be
an
input,
and
Note
that
one
or
more
of
the
variables
z
∂a
∂En
∂Encalculera
i
j
Pour
simplifier,
on
d’abord
les
dérivées
par
rapport
à
!
=
a
.inand
(5.50) and then apply
j (5.53). w
for =
each
hidden
output
unit
in
the
network,
duce a useful notation
similarly,
the
unit
j
(5.49)
could
be
an
output.
∂aj ∂win
nd weights
is∂willustrated
ji
ji Figure 5.7. Note that the units labelled k
As
we
have
output
units, that
we have
!
i the
For each pattern seen
in thealready,
training for
set, the
we shall
suppose
we have supplied
∂E
uce
notation
n
thera useful
hidden
units
and/or
output
units. In writing
down
(5.55),
we areof all o
Pour la
couche
de sortie
:! activations
corresponding
and
calculated
the
δj ≡input vector to the network
(5.51)
=successive
yk −in
tk the
δkby
∂a
rise
variations
error funche fact that the
variations
ajj give
hidden ∂E
andnin
output
units in
the to
network
application
of (5.48) and
δ
≡
(5.51)
j This process is often called forward propagation because it can be regarded
gh variations(5.49).
in the
variables
ak . If we now substitute the definition
∂aj
are often referred
to
as
errors
for
reasons
shallthesee
shortly. Using
as a forward flow of information we
through
network.
5. NEURAL
Chaque
dérivée
peut
être use
calculée
à partir
des(5.49),
dérivées
la couche
5.51)
intoNETWORKS
(5.55),
and
make
of (5.48)
and
wedeobtain
the
Now
consider
the evaluation
of the
derivative
rewrite
often referred to as
errors
for reasons
we shall see
shortly.
Usingof En with respect to a weigh
:!
propagation
formula
∂aj of the various units will depend on the particular input pattern n
write suivantew
ji . The outputs
ure 5.7 Illustration of the∂a
calculation of
hidden unit j by
=δjto
zfor
(5.52)
i .keep
! backpropagation
j in
However,
order
the
notation
uncluttered,
we
shall
omit the subscript n
of the
δ’s
from
those
units
k
to
which
∂w
zi
= ji
zi .
(5.52)
δk
!
unit j sends
connections.
arrow denotes
the note that Enδjdepends on the weight wji only
from
thejinetwork
variables.
First we
∂w
′The blue
wji
wkj
= flow
h (a
w
δ
(5.56)
δj summed
direction
of information
during
forwardobtain
propagation,
j )then
kj
k
5.51)
and
(5.52)
into
(5.50),
we
to
unit
j.
We
can
therefore
apply
the
chain
rule for partia
via
the
input
a
j
the red
indicate
the backward
51) andand
(5.52)
intoarrows
(5.50),
we then
obtain propagation
zj
derivatives to give
k
of error information.
δ1
∂En
∂En ∂aj
∂En∂En
=
(5.50
(5.53)
= δj z=
(5.53).
i . δj zi .
∂w
∂a
∂w
that the value of
for
unit
can
be obtained
by2006]!
ji [C. Bishop,
j Pattern
ji
∂wjiδ∂w
recognition
and Machine learning,
ji a particular hidden
Retro-propagation du gradient (2)!
provided we are using the canonical link as the output-unit activation function. To
Retour : généralisation, !
sélection de modèles!
Complexité du modèle de prédiction : !
• Nombre de couches cachées!
• Nombre d’unités cachées par couche!
5.5. Regularization in Neural Networks
M =1
1
M =3
1
0
0
−1
−1
−1
0
1
0
M = 10
1
0
1
257
0
1
Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The
graphs
show the result
of fitting
networks
1 couche
cachée
avec
M having
unités!M = 1, 3 and 10 hidden units, respectively, by minimizing a
sum-of-squares error function using a scaled conjugate-gradient algorithm.
Données générées avec!
of the form
λ
!
E(w)
= E(w) + wT w.
(5.112)
2
[C. Bishop, Pattern recognition and Machine learning, 2006]!
This regularizer is also known as weight decay and has been discussed at length
Arrêt prématuré (early stopping)!
L’apprentissage est itérative (1 itération est appelée 1 « epoch »).!
On peut diminuer le sur apprentissage (overfitting) en arrêtant
l’apprentissage lorsque l’erreur sur
base de validation
commence261
à
5.5.la
Regularization
in Neural Networks
augmenter.!
!
0.45
0.25
0.4
0.2
0.15
0
10
20
30
40
50
0.35
0
10
20
30
40
50
Figure 5.12 An illustration of the behaviour of training set error (left) and validation set error (right) during a
typical
training
a function of the iteration step, for
the sinusoidal
data de
set.validation!
The goal of achieving
Erreur
sursession,
la baseasd’apprentissage!
Erreur
sur la base
the best generalization performance suggests that training should be stopped at the point shown by the vertical
dashed lines, corresponding to the minimum of the validation set error.
[C. Bishop, Pattern recognition and Machine learning, 2006]!
parameter λ. The effective number of parameters in the network therefore grows
ft-to-right
ght-to-left
p-to-down
own-to-up
ick up
ang up
Application : !
reconnaissance de gestes!
he data is stored for each frame and each training vector is
dicate the label by pressing the button. So, this process is a ground
ther the proper objective data for the training.
d, we can update the neural network with a supervised learning as we
to its respective feature vectors.
ew data is a ground truth set, we can not update the Filter neural
ded is labelled as a gesture. As we need gesture samples and
e Filter neural network, the only classifier that is trained is the General
Classification de gestes à partir de la pose articulée d’un humain
(positions des articulations estimées à l’aide d’un capteur Kinect).!
Réseau
de
neurones!
ng of new vector labelling by pressing the correspondent key
Poses consécutives!
(Positions des articulations) !
12
Caractéristiques
invariantes (point de vue,
personnes etc.)!
Geste!
La classification supervisée!
44
kppV (k plus proches voisins)!
1. INTRODUCTION
5
4
class densities
!
1.2
p(x|C2 )
p(C1 |x)
1
p(C2 |x)
Classifieurs génératifs linéaires
(Bayésiens)!
0.8
3
0.6
2
0.4
p(x|C1 )
1
0
0.2
0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
x
x
0.6
0.8
!
1
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
!
Réseaux de neurones!
!
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
KERNEL MACHINES
Forêts aléatoires!
𝑃 (𝑐)
𝑃
!
stration of the ν-SVM for reession applied to the sinusoidal
osses
indicates
the
nthetic data
set using
Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
1
predicted
regression of trees. Each tree consists of split nodes (blue) and leaf nodes
enels.
theThe
offset
pixels
ve is shown by the red line, and t
atures
give tube
a large
ϵ-insensitive
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features
at new
0 by different trees for a particular input.
are shown in green, taken
d those with support vectors
e indicated by blue circles.
to disambiguate
SVM (Support Vector Machines)!
3.3. Randomized decision forests
−1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
29
|y − t|q
Bases : théorie
de
l’information!
receive
no
information.
Our measure
of information
on the
probability
distribution
p(x), and
we
therefo
h(x) = − log p(x)
(1.92)
2
on
the
probability
distribution
p(x),
and we p(x)
therefo
is
a
monotonic
function
of
the
probability
an
Entropie h(x) d’une variable aléatoire x : mesure du contenu
iscontent.
a monotonic
function
of the
probability
p(x)
and
where the negative sign ensures
that information
isof
positive
or
zero.
that
The form
h(·)
can
be Note
found
bylow
notin
d’information.!
probability events x correspond
to high
information
content.
ofbybasis
content.
The
form
of h(·)
canThe
be choice
found
noting
0 are
and
y
that
unrelated,
then
the
information
gai
1 isde
2 variables
−2the
−1 we shall
0 adopt the
1 convention
2
Peut
dérivé
deux
indépendantes
:!
for
the0être
logarithm
arbitrary,
and
for
moment
and
y that
are
unrelated,
then
the information
gain
should
be
the
sum
of
the
information
gained
from
y
−
t
y
−
t
prevalent in information theory of using logarithms to the base of 2. In this case, as
should
the
sum+of
the information
gained
from
h(x,
y)be=are
h(x)
h(y).
Two unrelated
events
wille
we shall see shortly, the units
ofq h(x)
bits (‘binary
digits’).
y)
=y)h(x)
+ h(y).
Two
events
will
PlotsNow
of thesuppose
quantitythat
Lq =
|yh(x,
−
for various
values
ofthe
q. value
a sender
wishes
to=transmit
ofunrelated
a these
randomtwo
variable
to
sot| p(x,
p(x)p(y).
From
relationsh
somust
p(x,
y)
= p(x)p(y).
From
these
twoprocess
relationshi
a receiver.
average amount
ofbe
information
that logarithm
they
transmit
the
ExerciseThe
1.28
given
by the
ofinp(x)
and
so iswe h
Exercise
must beofgiven
the
logarithm
of p(x) and
soand
we ha
obtained
by 1.28
taking the expectation
(1.92)by
with
respect
to the distribution
p(x)
h(x) = − log2 p(x)
(1.92)
is given by
!
52
1.sign
INTRODUCTION
p(x) log2isp(x).
(1.93)
H[x] =
− information
where the negative
ensures
that
positive or zero. Note that
low
x information
probability events0.5 x correspond to high
content. The choice of basis
0.5
for
logarithm
is arbitrary,
andthe
forentropy
the moment
shall variable
adopt thex. convention
Thisthe
important
quantity
is called
of the we
random
Note that
H = 1.77
H = 3.09
prevalent
in information theory of using logarithms to the base of 2. In this case, as
lim
p→0 p ln p = 0 and so we shall take p(x) ln p(x) = 0 whenever we encounter a
we
shall
shortly,
the
units
of h(x) are bits (‘binary digits’).
value
forsee
x such
that
p(x)
=
0.
0.25
0.25
Now
a sender
wishes
to transmit
the value
a randomofvariable
to
So
farsuppose
we havethat
given
a rather
heuristic
motivation
for theofdefinition
informaa receiver. The average amount of information that they transmit in the process is
obtained by taking the expectation of (1.92) with respect to the distribution p(x) and
is given by
0
0
!
[C. Bishop,
Pattern
Figure 1.30 Histograms of two probability distributions over 30 bins illustrating
the higher
valuerecognition
of the entropy and Machine learning, 2006]!
H for the broader distribution.
The largest
from alog
uniform distribution
that would give H =
p(x)
p(x).
(1.93)
H[x]
=entropy
− would arise
probabilities
−1
received
moreofand
information
thanthat
if we
Plots of the quantity Lq = |yhave
−
t|q for
various
values
q. if we knew
has
just
occurred,
the were
event to
w
1 information.
has
just occurred,
and if we
that of
theinformatio
event w
receive
no
Ourknew
measure
probabilities
29
Arbres de décision!
(𝐼, x)
tree 1
Classification
extrêmement
rapide!!
𝑃 (𝑐)
𝑃
sses indicates the
Figure 4. Randomized Decision Forests. A forest is an
• A chaque nœud d’un arbre, une valeur du vecteur est examinée !
the •offset
pixels
Eachcomparaison
tree consists
(blue) and
Parcours
à gauche of
ou trees.
droite selon
avecof
unsplit
seuil nodes
(différent
pourachaque
ures give
large sommet)!
(green). The red arrows indicate the different paths tha
• Chaque
feuille contient
uneby
décision
(unetrees
classe
features
at new
taken
different
for a)! particular input.
o
• Paramètres :!
3.3. Randomized
decision forests
• Choix des éléments
pour chaque sommet!
• Seuil pour chaque Randomized
sommet!
disambiguate
decision trees and forests [35, 30,
•
pour chaque feuille!
proven fast and effective multi-class classifiers
Forêts aléatoires!
(𝐼, x)
tree 1
(𝐼, x)
tree 1
𝑃 (𝑐)
𝑃
𝑃 (𝑐)
𝑃
cates
es. The
theyellowFigure
crosses
indicates
the sous-ensembles
Randomized
FigureForests.
4. Randomized
A forest
Decision
an ensemb
Fo
Plusieurs arbres
sont4.entrainés
sur des Decision
différents
des is
données!
et
redpixels
circles
the offset
pixels
of contient
trees.
Each
tree consists
ofoftrees.
split nodes
Each tree
(blue)
consists
and leaf
of split
nod
Chaqueindicate
feuille
une distribution
d’étiquettes!
! example(green).
ee atwo
large
features The
givered
a large
arrows indicate
(green).
theThe
different
red arrows
paths indicate
that might
theb
!
(b),
at new
the sametaken
two features
at new
by different
trees for ataken
particular
by different
input. trees for a particul
Addition des distributions correspondantes :!
maller response.
3.3. Randomized decision
3.3. Randomized
forests
decision fo
biguate
w the classifier Randomized
to disambiguate
decision trees
Randomized
and forestsdecision
[35, 30, 2,
trees
8] hav
an
e body.
proven fast and effectiveproven
multi-class
fast and
classifiers
effectiveformultiman
, (1)pixel x, the features compute
a given
1 X
the forest
◆
✓
◆
P (c|I, x) = all treesPint (c|I,
x) . to give the final
(2) classification
u
v
T t=1
T
xparame+
dI x +
, (1)
X
1
dI (x)
dI (x)
P (c|I, x) =
Pt (c|I, x) .
(2)
alization (𝐼, x)Training. Each tree is trained on a different set of randomly
T t=1
depth at pixel x in image I, and parameh invariChaque
est appris
de2000
manière
indépendante
!
synthesized
images.arbre
A random
subset of
example
pixcribe
offsets
u
and
v.
The
normalization
Training.
Each
tree
is
trained
on
a
different
set
of randomly
ee
ce1ensures
offset the features
Apprentissage
couche
par
couche
è
le
gradient
de
els
from
each
image
is
chosen
to
ensure
a
roughly
even
disare depth invarisynthesized images. A random subset of 2000 example pixx)
camera.
across
Each
tree image
is trained
usingtothe
l’erreur
n’estparts.
pas
t on the body, a tribution
fixed world
space body
offset
elsdisponible!!!
from each
is chosen
ensure a roughly even disdulo
per-is closefollowing
the pixel
or far from
the camera.
𝑃! algorithm [20]: tribution across body parts. Each tree is trained using the
kground
us
3D translation 𝑃invariant
peralgorithm
[20]:
1.(𝑐)Randomly
propose
a following
set of
splitting
candidates
= ensemble
1.(modulo
Pour
un couche
donnée,
proposition
d’un
0
ean
dIoffset
(x ) pixel lies on
1. ✓Randomly
propose⌧ a). set of splitting candidates
=
(✓,the
⌧ ) background
(feature
parameters
and
thresholds
de candidats
de
paramètres
0
ds of4.the
image, theDecision
depth probe
dI (x
)
gure
Randomized
Forests.
A forest
is an ensemble
(✓,,⌧seuil
) (feature )!parameters ✓ and thresholds ⌧ ).
(choix
élément
2.
Partition
the
set
of
examples
Q
tive constant
value.
trees.
Each tree
consists of split nodes (blue) and leaf nodes = {(I, x)} into left
ocations
2. données
Partition
set of examples Q en
= {(I,
x)} into left
2.
Séparation
des
2 parties!
andpixel
right
subsets paths
by each
: be thed’apprentissage
two features
at different
locations
een).
The red arrows
indicate
the
different
that might
rge posand right subsets by each :
by different
trees
for agive
particular
input.
ksenupwards:
Eq.
1 will
a large
posQ ( ) = { (I, x) | f✓ (I,
x) < ⌧ }
(3)
dy,
but a
Ql ( ) = { (I, x) | f✓ (I, x) < ⌧ }
(3)
ixels x near the top of the body,l but a
3.
Randomized
decision
forests
dy. pixels
Fea- x lower down the body.
Qr ( Fea) = Q \ Ql ( ) Qr ( ) = Q \ Q(4)
for
(4)
l( )
!
res
such
decision
and forests
dRandomized
help find thin
verticaltrees
structures
such[35, 30, 2, 8] have
oven fast and effective
multi-class
many gain
3. Compute
the classifiers
giving the
information:
3.forlargest
Compute
the in giving
the largest gain in information:
3.
Choix
des
paramètres
candidats
maximisant le gain
ks
[20, 23, 36],
and can
se
provide
onlybea implemented
weak
signal efficiently on the ?
k features
signal
?
= argmax G( (5)
)
(5)
=
)
(mesure
d’entropie)!
PU
[34].
illustrated
in Fig.en
4, information
a forest
is an G(
ensemble
f the
body
pixel belongs
to,
but
inargmax
o,
but
inAsthe
cision
foresttrees,
they each
are sufficient
to accuT
consisting
of split and leafX
nodes.
X |Qs ( )|
to decision
accu|Q
(
)|
H(Q) ( )) (6)
H(Qs ( )) (6)
allsplit
trained
Theofdesign
ch
nodeparts.
consists
a feature
f✓ H(Q)
and a thresholdG(⌧ . ) s = H(Q
G(
) of=these
of
these
s
|Q|
s2{l,r}
yclassify
motivated
by their computational effiand re- |Q|
onal
effi-pixel x in image I, one starts at the roots2{l,r}
essing
needed; Eq.
each1,feature
need left
onlyor right according
atedly isevaluates
branching
where Shannon entropy H(Q) ...
is computed
on the noreed
only
Shannon entropie!
ethe
pixels
and perform
atwhere
most 5Shannon
comparison
to threshold
⌧arithmetic
. At theentropy
leaf nodeH(Q)
reached
is
computed
on
the
normalized histogram of body part labels lI (x) for all
ithmeticcan be straightforwardly
4. PRécursion!
features
impletree t, a learned
distribution
(c|I,
x)
over
body
partpart
la- labels l (x) for all
malized thistogram of body
Apprentissage des paramètres!
pman
Andrew Blake
dge Application
& Xbox Incubation: estimation
- “R”: Right hand is used for perform the gestures.
- “L”: Left hand is used for perform the gestures.
- “B”: Both hands are used for perform the gestures.
de la pose
humaine (MS Kinect)!
After that, we can collect data of a desired class everytime we want by pressing th
button on your computer keyboard. The buttons for collect each class data are:
- “1“: Stop
- “2”: Scroll: left-to-right
- “3”: Scroll: right-to-left
- “4”: Scroll: up-to-down
- “5”: Scroll: down-to-up
- “6”: Phone: pick up
- “7”: Phone: hang up
- “8”: Hi
While the button is pressed , the data is stored for each frame and each training ve
automatically labeled as we indicate the label by pressing the button. So, this proc
truth collection because we gather the proper objective data for the training.
Once we have new data stored, we can update the neural network with a supervise
have the labels corresponding to its respective feature vectors.
Taking into account that the new data is a ground truth set, we can not update the
network because all data recorded is labelled as a gesture. As we need gesture sam
no-gesture samples to train the Filter neural network, the only classifier that is tra
neural network.
Figure 9. Recording of new vector labelling by pressing the correspondent key
depth image
body parts
3D joint proposals
[J. Shotton et al., 2011]!
Figure 1. Overview. From an single input depth image, a per-pixel
synthetic (train & test)
Données d’apprentissage!
Figure 2. Synthetic and real data. Pairs of depth image and ground truth body p
31 parties du corps humains!
the tasksynthétiques
of background
which
we as1simplify
Million d’images
issuessubtraction
d’une pipeline
de rendu
(la moitiethe
à app
partir
capture
mouvements
humains)! for our approach,
sumed’une
in this
work.deBut
most importantly
While d
2000
de chaque to
image
= 2.000.000.000
vecteurs
d’apprentissage!
it is pixels
straightforward
synthesize
realisticde
depth
images
of
plicitly
3people
arbres,and
chacun
20 niveaux!
thusde
build
a large training dataset cheaply.
be enco
Chaque feuille : 1 histogramme de 31 valeurs = 32.505.856 valeurs!
data to
2.2. Motion capture data
220-1 = 1.048.575 seuils!
The
!
The
human body is capable of an enormous range of
parame
parts for left and right allow the classifier to disambiguate
the left and right sides of the body.
Of course, the precise definition of these parts could be
changed to suit a particular application. For example, in an
(a)upper body tracking scenario,
(b)all the lower body parts could
𝜃
be merged.
Parts should be sufficiently 𝜃small to accurately
localize body joints, but not too numerous as to waste capacity of the classifier.
𝜃 Depth image features
3.2.
𝜃
We employ simple depth comparison features, inspired
by those in [20]. At a given pixel x, the features compute
◆yellow✓crosses indicates
◆
✓
Figure 3. Depth image features.
The
the
u
v
f✓ (I,classified.
x) = dI The
x + red circles indicate
dI x +the offset pixels
, (1)
pixel x being
dI (x)
dI (x)
as defined in Eq. 1. In (a), the two example features give a large
where dI (x)
is the depth
at pixel
x in image
I, and paramedepth difference
response.
In (b),
the same
two features
at new
ters ✓ = (u,
v)a describe
offsets
u
and! v. The normalization
X ….
#
#position
du
pixel,
image locations
give
much
smaller
response.
u,v …
of the offsets
by d# I 1(x)#offsets!
ensures the features are depth invaridI …. #
#profondeur!
ant:
at
a
given
point
on
body,
a fixed world
space offset
parts for left and right allowthethe
classifier
to disambiguate
pixel
is close or far from the camera.
the leftwill
andresult
rightwhether
sides ofthethe
body.
The features are thus 3D translation invariant (modulo per-
Les caractéristiques!
p
ta
G
o
E
T
p
to
in
b
aF
o
(
t
T
s3
e
trp
La classification supervisée!
44
kppV (k plus proches voisins)!
1. INTRODUCTION
5
4
class densities
!
1.2
p(x|C2 )
p(C1 |x)
1
p(C2 |x)
Classifieurs génératifs linéaires
(Bayésiens)!
0.8
3
0.6
2
0.4
p(x|C1 )
1
0
0.2
0
0.2
0.4
0.6
0.8
0
1
0
0.2
0.4
x
x
0.6
0.8
!
1
Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left
plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the
class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The
vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification
rate.
Classifieurs discriminatifs linéaires!
be of low accuracy, which is known as outlier detection or novelty detection (Bishop,
1994; Tarassenko, 1995).
However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint
distribution p(x, Ck ) when in fact we only really need the posterior probabilities
p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in
exploring the relative merits of generative and discriminative approaches to machine
learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006).
An even simpler approach is (c) in which we use the training data to find a
discriminant function f (x) that maps each x directly onto a class label, thereby
combining the inference and decision stages into a single learning problem. In the
example of Figure 1.27, this would correspond to finding the value of x shown by
the vertical green line, because this is the decision boundary giving the minimum
probability of misclassification.
With option (c), however, we no longer have access to the posterior probabilities
p(Ck |x). There are many powerful reasons for wanting to compute the posterior
probabilities, even if we subsequently use them to make decisions. These include:
!
Réseaux de neurones!
!
Minimizing risk. Consider a problem in which the elements of the loss matrix are
(𝐼, tox)revision from time to time (such as might occur in a financial
subjected
tree 1
KERNEL MACHINES
Forêts aléatoires!
𝑃 (𝑐)
𝑃
!
stration of the ν-SVM for reession applied to the sinusoidal
osses
indicates
the
nthetic data
set using
Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble
1
predicted
regression of trees. Each tree consists of split nodes (blue) and leaf nodes
enels.
theThe
offset
pixels
ve is shown by the red line, and t
atures
give tube
a large
ϵ-insensitive
corresponds (green). The red arrows indicate the different paths that might be
the shaded region. Also, the
oa points
features
at new
0 by different trees for a particular input.
are shown in green, taken
d those with support vectors
e indicated by blue circles.
to disambiguate
SVM (Support Vector Machines)!
3.3. Randomized decision forests
−1
Randomized decision trees and forests [35, 30, 2, 8] have
proven fast and effective multi-class classifiers for many
SVM (« Support Vector Machines »)!
L’hypothèse de départ est que les données sont linéairement
séparables (cette hypothèse sera partiellement relâchée)!
Wikipedia!
Pour un ensemble de points linéairement séparables, il existe une infinité
d'hyperplans séparateurs!
SVM : la marge!
On cherche obtenir l’hyperplan ayant la marge maximale (distance
entre les échantillons de classes différentes).!
Objectif : réduire l’erreur sur les données non vues.!
SVM : « kernel trick »!
Certaines données ne sont pas linéairement séparables.!
Astuce : projection des données dans un espace où elles le sont.!
SVM : applications!
Les SVM ont été utilisées pour quasiment toutes les applications en
image / vision par ordinateur.!
Quelques applications développées au LIRIS :!
Reconnaissance d’actions
individuelles!
[Wolf et Taylor, 2010]!
Reconnaissance
d’actions collectives!
Reconnaissance
de textes!
[Baccouche/Mamalet/Wolf/
Garcia/Baskurt, 2012]!
[Wolf et Jolion, 2003]!
Popularité des classifieurs!
Google N-gram viewer, requête le 24.7.2014 :!
Google fight !
(24.7.2014)!
Téléchargement