M2R IGI! Cours « Méthodes avancées en Image et Vidéo » ! ! Machine Learning 1! ! Christian Wolf! INSA-Lyon! LIRIS! Méthodes avancées en image et vidéo! C. Wolf! Doua! Machine Learning 1 (introduction, classification)! C. Wolf! Doua! Machine Learning 2 (features, deep learning, modèles graphiques)! M. Ardabilian! ECL! Indexation par le contenu : évaluation des techniques et systèmes à base d’images! M. Ardabilian! ECL! Acquisition multi-image et applications! M. Ardabilian! ECL! Approches de super-résolution et applications! M. Ardabilian! ECL! Méthodes de fusion en imagerie! M. Ardabilian! ECL! Détection, analyse et reconnaissance d’objets! J. Mille! Doua! Méthodes variationnelles! M. Ardabilian! ECL! La biométrie faciale! E. Dellandréa! ECL! Le son au sein de données multimédia : codage, compression et analyse! S. Duffner! Doua! Suivi d’objets! E. Dellandréa! ECL! Un exemple d'analyse audio : Reconnaissance de l'émotion à partir de signaux de parole et de musique! S. Bres! Doua! Indexation d’images et de vidéos! V. Eglin! Doua! Analyse de documents numériques 1! F. LeBourgeois! Doua! Analyse de documents numériques 2! Sommaire! Introduction! 1 0 1 0 x 1 cope of this book. eeds its own tools and techniques, many of the ommon to all such problems. One of the main in a relatively informal way, several of the most illustrate them using simple examples. Later in deas re-emerge in the context of more sophistieal-world pattern recognition applications. This ed introduction to three important tools that will y probability theory, decision theory, and inforght sound like daunting topics, they are in fact anding of them is essential if machine learning ect in practical applications. rve Fitting egression problem, which we shall use as a runer to motivate a number of key concepts. Supvariable x and we wish to use this observation to rget variable t. For the present purposes, it is inmple using synthetically generated data because at generated the data for comparison against any xample is generated from the function sin(2πx) rget values, as described in detail in Appendix A. a training set comprising N observations of x, r with corresponding observations of the values ure 1.2 shows a plot of a training set comprising ta set x in Figure 1.2 was generated by choosspaced uniformly in range [0, 1], and the target puting the corresponding values of the function – Principes générales! – Fitting et généralisation, complexité des modèles! – Minimisation du risque empirique! Classification supervisée! – – – – – KPPV (k plus proches voisins)! Modèles linéaires pour la classification! Réseaux de neurones! Arbres de décisions et forêts aléatoires! SVM (Support Vector machines)! Extraction de caractéristiques! – Manuelle! – ACP (Analyse de composantes principales)! – Réseaux de neurones convolutionnels ! (« deep learning »)! Quelques sources! Christopher M. Bishop! “Pattern Recognition and Machine Learning”! Springer Verlag, 2006! Microsoft Research, ! Cambridge, UK Kevin P. Murphy, “Machine Learning”! MIT Press, 2013! Google! Formerly : University of British Columbia, ! Reconnaissance de formes! Classification supervisée! Luc! Boite! Lego! Reconnaissance d’objets à partir d’images:! • Chiffres, lettres, logos! • Biométrie: visages, empreintes, iris, …! • Classes d’objets (voitures, arbres, bateaux, …)! • Objects spécifiques (une voiture, un mug, …)! Prédiction! De manière générale, nous aimerions prédire une valeur t à partir d’une observation x! ! ! ! Si t est continue : Si t est discret : # #régression! #classification! ! Apprentissage des paramètres à partir de données.! Apprentissage et généralisation! Apprendre à classifier des données revient à apprendre une fonction de décision : la frontière entre les classes.! Apprentissage et généralisation! La complexité d’une fonction de décision dépend de la complexité du regroupement des étiquettes dans l’espace de caractéristiques (feature space)! 4 1 4 0 7 2 3 6 5 9 Fitting et Généralisation! • • Données générées par une fonction! Objectif : supposant la fonction inconnue, prédire t à partir de x! ng data set of N = wn as blue circles, ng an observation riable x along with ding target variable curve shows the πx) used to generOur goal is to preof t for some new hout knowledge of e. 1 t 0 −1 0 ment lies beyond the scope of this book. x 1 [C. Bishop, Pattern recognition and Machine learning, 2006]! ial function of the form manner, and quantitative and decision theory, discussed in Section 1.5, allows us to exploit this probabilistic representation in Morder to make predictions that are optimal " according criteria. w1 x + w2 x2 + . . . + wM xM = wj xj (1.1) y(x, w) = wto 0 +appropriate For the moment, however, we shallj =0 proceed rather informally and consider a based and on xcurve fitting. Intoparticular, j denotes x raised the power ofwe j. shall fit the data using a is simple the orderapproach of the polynomial, «coefficients Fittingfunction »wd’un polynôme polynomial of the formd’ordre M! nomial 0 , . . . , wM are collectively denoted by the vector w. Fitting et Généralisation! , although the polynomial function y(x, w) is a nonlinear function of x, it M r function of the coefficients w. Functions, such as the polynomial, which " M in the unknown parameters properties 2and are called linear = wj xj (1.1) y(x, w) have = wimportant 0 + w1 x + w2 x + . . . + wM x nd will be discussed extensively in Chapters 3 and 4. j =0 values of the coefficients will be determined by fitting the polynomial to the data. This can be done by minimizing an error function that measures the denotes x raised to the power of j. where M is the order ofany thegiven polynomial, x»j training 6des tween the function y(x, w),1.«INTRODUCTION formoindres valuecarrées of w, and and the Critère (des set erreurs)! ts. The One simple choice ofcoefficients error function, which widely used, is given by denoted by the vector w. , w are collectively polynomial w 0 , . . . is M 1.3 The error function (1.2) correfor w) eachisdata of the squares the errorsFigure between the predictions y(xn , ofw) n sponds to (one half of) the sum t Note that,ofalthough the polynomial function y(x, a nonlinear tfunction of x, it of the and the corresponding target valuesthe tnsquares , so that wedisplacements minimize by the vertical w. green Functions, bars) is a linear function of the (shown coefficients such as the polynomial, which of each data point from the function N y(x, w). are linear in the unknown parameters have important properties and y(x are, w)called linear 1" 2 {y(xn , w) − tn } models andE(w) will=be2 discussed extensively in Chapters 3(1.2) and 4. n=1 The values of the coefficients will be determined by fitting the polynomial to the e factor of 1/2data. is included We shall discuss the mo-function that measures the training This for canlater beconvenience. done by minimizing an error or this choice of error function later in this chapter. For the moment we misfit between the function y(x, w), for any given value of w, and the xtraining set ote that it is a nonnegative quantity that would be zero if, and only if, the x data points. One simple choice of error function, which is widely used, is given by Dérivée linéaire -> solution directe! for each data the sum of the squares of function the errors between thethrough predictions y(x y(x, w) were to pass exactly each training data point. The geometn , w) rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3. target values tn ,[C.soBishop, thatPattern we minimize point xn and the corresponding recognition and Machine learning, 2006]! n n We can solve the curve fitting problem by choosing the value of w for which Sélection d’un modèle! Quel ordre M pour le polynôme?! Underfitting! Underfitting! M =0 1 M =1 1 t 7 1.1. Example: Polynomial Curve Fitting t 0 0 −1 −1 0 x 1 0 x 1 Overfitting! M =3 1 M =9 1 t t 0 0 −1 −1 0 x 1 0 x 1 [C. Bishop, Pattern recognition and Machine learning, 2006]! Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown in Figure 1.2. 0 Sélection d’un modèle! −1 Séparation des données en deux parties :! • Base d’apprentissage (Training set)! •1 Base de0 test (test set)! x 1. INTRODUCTION x 1 ous orders M , shown as red curves, fitted to the data set shown in Figure 1.5 ERMS ! = 2E(w⋆ )/N 1 Training Test ERMS by Graphs of the root-mean-square error, defined by (1.3), evaluated on the training set and on an independent test set for various values Root Mean Square Error (RMS)! of M . 0.5 (1.3) n by N allows us to compare different sizes of data sets on the square root ensures that ERMS is measured on the same me units) as the target variable t. Graphs of the training and 0 are shown, for various values of M , in Figure 1.5. The test 0 3 M 6 9 e of how well we are doing in predicting the values of t for s of x. We note from Figure 1.5 that small values of M give M =and 9, the setattributed error goes to as wethat might expect because s of the test setForerror, thistraining can be tozero, the fact [C. Bishop, Pattern recognition and Machine learning, 2006]! contains 10 degrees freedom corresponding to the 10 coefficients lynomials this arepolynomial rather inflexible and are of incapable of capturing Validation croisée! La séparation des données en deux parties change itérativement. La mesure de performance est la moyenne sur toutes les itérations.! 1.4. The Curse of Dimensionality cross-validation, illusof S = 4, involves taknd partitioning it into S case these are of equal groups are used to train hen evaluated on the recedure is then repeated s for the held-out group, blocks, and the perforruns are then averaged. 33 run 1 run 2 run 3 run 4 nce. When data is particularly scarce, it may be appropriate [C. which Bishop, Pattern recognition and Machine learning, 2006]! = N , where N is the total number of data points, gives Big Data!! Le problème de l’overfitting diminue en augmentant la taille de la base d’apprentissage.! 9 1.1. Example: Polynomial Curve Fitting N = 15 1 N = 100 1 t t 0 0 −1 −1 0 x 1 0 x 1 Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9 polynomial M=9! for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the size of the data set reduces the over-fitting problem. ing polynomial function matches each of the data points exactly, but between data [C. Bishop, Pattern recognition and Machine learning, 2006]! points (particularly near the ends of the range) the function exhibits the large oscillations observed in Figure 1.4. Intuitively, what is happening is that the more flexible Régularisation! may wish to use relatively complex and flexible models. One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function (1.2) in order to discourage Ajouter un terme supplémentaire : restriction des paramètres du the coefficients from reaching large values. The simplest such penalty term takes the modèle form of a sum! of squares of all of the coefficients, leading to a modified error function Paramètre de régularisation! of the form N " 1 λ 2 ! E(w) = {y(xn , w) − tn } + ∥w∥2 (1.4) 2 2 10 1. INTRODUCTION 2 ∥w∥ ≡ wT w = w02 n=1 w0 est souvent exclu ! 2 + w12 + . . . + wM , and the coefficient λ governs the relwhere ative importance of the regularization term compared with the sum-of-squares error because its term. 1Note that often the lncoefficient w0 is1omitted from the regularizer λ = −18 ln λ = 0 inclusion causes the results to depend on the choice of origin for the target variable t t (Hastie et al., 2001), or it may be included but with its own regularization coefficient 0 (we shall discuss this topic in more detail in0 Section 5.5.1). Again, the error function in (1.4) can be minimized exactly in closed form. Techniques such as this are known in the−1statistics literature as shrinkage methods because they reduce the value of the −1 coefficients. The particular case of a quadratic regularizer is called ridge regression (Hoerl and Kennard, 1970). xIn the of neural networks, this approach is 0 1 context 0 1 x known as weight decay. Figure M=9!1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error Figure 1.7 shows of parameter fitting the polynomial of−18 order =The9 to the function (1.4) for two values the of theresults regularization λ corresponding to ln λ = and lnM λ = 0. [C. Bishop, Pattern recognition and Machine of noset regularizer, i.e., λ =but 0, corresponding to lnthe λ = −∞, is shown at the bottomfunction right of Figure 1.4. samecase data as before now using regularized error given bylearning, (1.4).2006]! La classification supervisée! Nouvelle entrée! Classifieur! Base d’apprentissage (étiquetée)! Modèle! Algorithme d’apprentissage! Classe! La classification supervisée! 44 kppV (k plus proches voisins)! 1. INTRODUCTION 5 4 class densities ! 1.2 p(x|C2 ) p(C1 |x) 1 p(C2 |x) Classifieurs génératifs linéaires (Bayésiens)! 0.8 3 0.6 2 0.4 p(x|C1 ) 1 0 0.2 0 0.2 0.4 0.6 0.8 0 1 0 0.2 0.4 x x 0.6 0.8 ! 1 Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. Classifieurs discriminatifs linéaires! be of low accuracy, which is known as outlier detection or novelty detection (Bishop, 1994; Tarassenko, 1995). However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution p(x, Ck ) when in fact we only really need the posterior probabilities p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006). An even simpler approach is (c) in which we use the training data to find a discriminant function f (x) that maps each x directly onto a class label, thereby combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification. With option (c), however, we no longer have access to the posterior probabilities p(Ck |x). There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include: ! Réseaux de neurones! ! Minimizing risk. Consider a problem in which the elements of the loss matrix are (𝐼, tox)revision from time to time (such as might occur in a financial subjected tree 1 KERNEL MACHINES Forêts aléatoires! 𝑃 (𝑐) 𝑃 ! stration of the ν-SVM for reession applied to the sinusoidal osses indicates the nthetic data set using Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble 1 predicted regression of trees. Each tree consists of split nodes (blue) and leaf nodes enels. theThe offset pixels ve is shown by the red line, and t atures give tube a large ϵ-insensitive corresponds (green). The red arrows indicate the different paths that might be the shaded region. Also, the oa points features at new 0 by different trees for a particular input. are shown in green, taken d those with support vectors e indicated by blue circles. to disambiguate SVM (Support Vector Machines)! 3.3. Randomized decision forests −1 Randomized decision trees and forests [35, 30, 2, 8] have proven fast and effective multi-class classifiers for many “k-plus proches voisins”! Probablement le classifieur le plus simple.! La base d’apprentissage (vecteurs et étiquettes ) est stocké.! Pour un nouveau vecteur , la donnée la plus proche est cherchée et son étiquette sert comme prédiction.! où Dim 2 “k-plus proches voisins”! 4 1 5 0 4 2 7 Nouvelle donnée 6 3 9 Dim 1 Optimal dans la limite d’un nombre infini de données d’apprentissage.! La classification supervisée! 44 kppV (k plus proches voisins)! 1. INTRODUCTION 5 4 class densities ! 1.2 p(x|C2 ) p(C1 |x) 1 p(C2 |x) 0.8 3 Classifieurs génératifs (Bayésiens)! 0.6 2 0.4 p(x|C1 ) 1 0 0.2 0 0.2 0.4 0.6 0.8 0 1 0 0.2 0.4 x x 0.6 ! 0.8 1 Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. Classifieurs discriminatifs linéaires! be of low accuracy, which is known as outlier detection or novelty detection (Bishop, 1994; Tarassenko, 1995). However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution p(x, Ck ) when in fact we only really need the posterior probabilities p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006). An even simpler approach is (c) in which we use the training data to find a discriminant function f (x) that maps each x directly onto a class label, thereby combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification. With option (c), however, we no longer have access to the posterior probabilities p(Ck |x). There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include: ! Réseaux de neurones! ! Minimizing risk. Consider a problem in which the elements of the loss matrix are (𝐼, tox)revision from time to time (such as might occur in a financial subjected tree 1 KERNEL MACHINES Forêts aléatoires! 𝑃 (𝑐) 𝑃 ! stration of the ν-SVM for reession applied to the sinusoidal osses indicates the nthetic data set using Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble 1 predicted regression of trees. Each tree consists of split nodes (blue) and leaf nodes enels. theThe offset pixels ve is shown by the red line, and t atures give tube a large ϵ-insensitive corresponds (green). The red arrows indicate the different paths that might be the shaded region. Also, the oa points features at new 0 by different trees for a particular input. are shown in green, taken d those with support vectors e indicated by blue circles. to disambiguate SVM (Support Vector Machines)! 3.3. Randomized decision forests −1 Randomized decision trees and forests [35, 30, 2, 8] have proven fast and effective multi-class classifiers for many Modélisation probabiliste! Supposons un problème de classification en deux classes et! à partir de données 1D modélisées par une variable aléatoire ! 44 1. INTRODUCTION (L’observation)! ! Modélisation probabiliste :! 5 ! p(x|C2 ) 4 ! Probabilité postérieure! ! ! class densities ! A notre disposition : la vraisemblance ! des données (likelihood)! 3 2 p(x|C1 ) 1 0 0 0.2 0.4 0.6 x 0.8 1 [C. Bishop, Pattern recognition and Machine learning, 2006]! ht expect probabilities to play a role in making decisions. When we ob y image x for a new patient, our goal is to decide which of the two cl gn to the image. We are interested in the probabilities of the two classe Bayes’ theorem, these prob mage, which are given by p(Ck |x). Using Pour obtenir la probabilité postérieure, il faut inverser le modèle à l’aide be expressed inde theBayes form:! de la réglé Modélisation Bayesienne! Vraisemblance! ! ! 44 ! 1. INTRODUCTION p(x|Ck )p(Ck ) . p(Ck |x) = p(x) Probabilité a priori! Peut être ignorée pour la maximisation! La probabilité a priori modélise! that any of the quantities appearing in Bayes’ theorem can be obtain nos connaissances sur! 5 1.2 marginalizing or with re oint distribution p(x, Ck ) by either p(C1 |x) p(C2conditioning |x) le résultat, indépendamment! p(x|C2 ) ppropriate variables. We can now1 interpret p(Ck ) as the prior probability 4 des observations .! posterior probability. Thus p(C1 Ck , and p(Ck |x) as the corresponding 0.8 Dans les cas simples, elle! 3 s the probability that a person has0.6cancer, before we take the X-ray measu est tabulée, e.g. ! 44 1. INTRODUCTION 5 1 p(x|C2 ) class densities class densities 4 0 3 0 2 0 p(x|C1 ) 1 0 larly,2 p(C1 |x) is the corresponding probability, revised using Bayes’ the p(x|C ) of the information contained in0.4the X-ray. If our aim is to minimize the 1 Ici :! 0.2 intuitively we would choose signing x to the wrong class, then the class higher00 posterior probability. We now show that this intuition is correct, 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x discuss more general criteria for making[C.decisions. Bishop, Pattern recognition and Machine learning, 2006]! 0 0 0.2 0.4 0.6 0.8 1 x 1 Figure 1.27 Example of the class-conditional densities for tw plot) together with the corresponding posterior probabilities (r class-conditional density p(x|C1 ), shown in blue on the left plot, vertical green line in the right plot shows the decision bounda rate. be of low accuracy, which is known as 1994; Tarassenko, 1995). However, if we only wish to make ful of computational resources, and ex distribution p(x, Ck ) when in fact we p(Ck |x), which can be obtained direc conditional densities may contain a lo terior probabilities, as illustrated in F Génération de nouvelles données! Cette famille de modèles porte le nom « modèles génératifs » parce qu’elle permet de générer des nouvelles données respectant le modèle.! ! 1. On échantillonne une classe selon la distribution a priori ! e.g. ! ! ! ! 2. Pour la classe donnée, on échantillonne l’observation selon la vraisemblance ! 44 e.g.! 1. INTRODUCTION 5 1.2 p(x|C2 ) class densities 4 p(C1 |x) 1 p(C2 |x) 0.8 3 0.6 2 0.4 p(x|C1 ) 1 0 0.2 0 0.2 0.4 0.6 x 0.8 1 [C. 0Bishop, Pattern recognition and Machine learning, 2006]! 0 0.2 0.4 x 0.6 0.8 1 Probabilité d’erreur! La probabilité de commettre une erreur d’un certain type peut être calculée de manière directe :! 1. INTRODUCTION ! ! x0 ! p(x, C1 ) ! x ! p(x, C2 ) ! ! ! ! x ! R1 ! Figure 1.24 ! ! ! En R2 Schematic illustration of the joint probabilities p(x, Ck ) for each of two classes plotted against x, together with de the decision boundary,xla = zone x b. Values of x ! x best are classified as changeant le seuil décision rouge compressible.! class C2 and hence belong to decision region R2 , whereas points x < x b are classified as C1 and belong to R1 . Errors arise from the blue, green, and red regions, so that for x<x b the errors are due to points from class C2 being misclassified as C1 (represented by the sum of the red and green regions), and conversely for points in the region x ! x b the [C. Bishop, Pattern recognition and Machine learning, 2006]! errors are due to points from class C1 being misclassified as C2 (represented by the blue region). As we vary the location x b of the decision boundary, the combined areas of the La classification supervisée! 44 kppV (k plus proches voisins)! 1. INTRODUCTION 5 4 class densities ! 1.2 p(x|C2 ) p(C1 |x) 1 p(C2 |x) Classifieurs génératifs linéaires (Bayésiens)! 0.8 3 0.6 2 0.4 p(x|C1 ) 1 0 0.2 0 0.2 0.4 0.6 0.8 0 1 0 0.2 0.4 x x 0.6 0.8 ! 1 Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. Classifieurs discriminatifs linéaires! be of low accuracy, which is known as outlier detection or novelty detection (Bishop, 1994; Tarassenko, 1995). However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution p(x, Ck ) when in fact we only really need the posterior probabilities p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006). An even simpler approach is (c) in which we use the training data to find a discriminant function f (x) that maps each x directly onto a class label, thereby combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification. With option (c), however, we no longer have access to the posterior probabilities p(Ck |x). There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include: ! Réseaux de neurones! ! Minimizing risk. Consider a problem in which the elements of the loss matrix are (𝐼, tox)revision from time to time (such as might occur in a financial subjected tree 1 KERNEL MACHINES Forêts aléatoires! 𝑃 (𝑐) 𝑃 ! stration of the ν-SVM for reession applied to the sinusoidal osses indicates the nthetic data set using Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble 1 predicted regression of trees. Each tree consists of split nodes (blue) and leaf nodes enels. theThe offset pixels ve is shown by the red line, and t atures give tube a large ϵ-insensitive corresponds (green). The red arrows indicate the different paths that might be the shaded region. Also, the oa points features at new 0 by different trees for a particular input. are shown in green, taken d those with support vectors e indicated by blue circles. to disambiguate SVM (Support Vector Machines)! 3.3. Randomized decision forests −1 Randomized decision trees and forests [35, 30, 2, 8] have proven fast and effective multi-class classifiers for many ∥w∥are hyperplanes. To simplify the disnamely those for which the decision surfaces input vector we x and assignsfirst it tothe one of K cussion, consider case of two classes and then investigate the extension T Multiplying both sides of this result by w and adding w , and making use of y(x) = 0 ll restrict attention to linear discriminants, to K > 2 classes. wT xare +w ) =simplify wT x⊥ the +w 0 and y(x⊥To 0 = 0, we have aces hyperplanes. disclasses 4.1.1 and then Two investigate the extension classes y(x) Modèle linéaire pour la classification! . (4.7) r= The simplest representation of a linear discriminant function is obtained by tak∥w∥ fonction de décision estvector modélisée ingUne a linear function of the input so that par une fonction paramétrique :! This result is illustrated in Figure 4.1. r discriminant function is obtained by takx + w0 3, it is sometimes convenient (4.4) y(x) = in wTChapter As with the linear regression models that o use a more compact notation in which we introduce an additional dummy ‘input’ a bias to bethat confused with bias in w isand called a weight and T where 0 is x then w !vector, = (w , w)wles and ! = paramètres (x(not x +xw0 0= (4.4) wvalue 0dans 0 , x) so en est1un biais quidefine peut être intégré autres ajoutant la the constante statistical«sense). The negative of the bias is sometimes called a threshold. An 1 » aux entrées! corassignedwith to class Cin w ! Tx !!. 0 and to class C2 otherwise. The (4.8) y(x) =y(x) a biasvector (not toxbeisconfused bias 1 if 0 is input decision is therefore defined by the relation y(x) = 0, which biasresponding is sometimes called boundary a threshold. An Interprétation : ! In case,to the hyperplanes passing through x) this ! 0 and class Ca2(D otherwise. Theare cor-D-dimensional corresponds todecision −surfaces 1)-dimensional hyperplane within the D-dimensional input ! originby of therelation D +two 1-dimensional expanded input space.lie on the decision surface. ehe defined the y(x) = 0, and xB both of which space. Consider points xAwhich yperplane within input ) =D-dimensional y(xB ) = 0, we have wT (xA − xB ) = 0 and hence the vector w is Because y(xAthe 4.1.2 classes bothorthogonal of whichMultiple lie thevector decision surface. to on every lying within the decision surface, and so w determines the T (xorientation − xBconsider ) = 0ofand hence the vector w isSimilarly, ANow the extension of linear discriminants K >on2 the classes. We surface, might the decision surface. if x is atopoint decision surface, so so determines the bedecision tempted be=to0,and build awK-class discriminant by combining a number of two-class then y(x) and the normal distance from the origin to the decision surface is La constante « 1the » ajoute rly, if x is a point on decision surface, discriminant functions. However, this leads to some serious difficulties (Duda and given by un « biais » to au the modèle ! e from1973) the origin Hart, as we nowdecision show. surfacewisT x w0 =− . solves a two-class problem(4.5) Consider the use of K −1 classifiers each of∥w∥ which of ∥w∥ w Modèles linéaires (2 classes)! 0 target vector t. One justification for using least squares in such a context is that it for every possible pair of classes. This is known as atarget one-versus-one approximates the conditional expectation E[t|x] of the values givenclassifier. the input Each point isFor then according to athis majority vote amongst theisdiscriminant vector. theclassified binary coding scheme, conditional expectation given by thefunctions. of However, tooprobabilities. runs into theUnfortunately, problem of ambiguous regions, as illustrated vector posteriorthis class however, these probabilities in the right-hand diagramrather of Figure 4.2. are typically approximated poorly, indeed the approximations can have values Wethecan avoid difficulties by considering singlemodel K-class discriminant outside range (0, these 1), due to the limited flexibility of a linear as we shall Plusieurs fonctions paramétriques :! see shortly. K linear functions of the form comprising ! model so that Each class Ck is described by its own linear T (4.9) yk (x) = wk x + wk0 ! (4.13) yk (x) = wkT x + wk0 ! and then assigning a point x to class Ck if yk (x) > yj (x) for all j ̸= k. The decision Enk notation et en intégrant lethese biaistogether :! where = between 1, . . . ,vectorielle, K.class We can conveniently using vector class Cj group is therefore given by yk (x) = ynotaboundary Ck and j (x) and tion so that hence corresponds to a (D − 1)-dimensional hyperplane defined by " Tx # (4.14) y(x) = W Modèles linéaires (K classes)! (wk − wj )T x + (wk0 − wj 0 ) = 0. (4.10) This has the same form as the decision boundary for the two-class case discussed in Interprétation Section 4.1.1, and so analogous geometrical properties apply. : ! The decision regions of such a discriminant! are always singly connected and convex. To see this, consider two points xA and x! B both of which lie inside decision ! that lies on the line connecting region Rk , as illustrated in Figure 4.3. Any point x « The winner takes it all »! xA and xB can be expressed in the form ! x ! = λxA + (1 − λ)xB (4.11) Problème simple (1D, 3 classes)! Classifieur linéaire : entrées 1D, 3 classes! Entrée :! Paramètres :! Sortie d’une classe :! Sortie de toutes les classes :! Problème simple (1D, 3 classes)! Données d’apprentissage! Frontières « réelles » entre classes! Frontières estimées! Le cas non-linéaire! Prétraitement : transformation non-linéaire des données ! (à choisir préalablement selon l’application) :! 204 4. LINEAR MODELS FOR CLASSIFICATION 1 1 φ2 x2 0 0.5 −1 0 −1 0 x1 1 0 0.5 φ1 1 Figure 4.12 Illustration of the role of nonlinear basis functions in linear classification models. The left plot shows the original input space (x1 , x2 ) together with data points from two classes labelled red and blue. Two ‘Gaussian’ basis functions φ1 (x) and φ2 (x) are defined in this space with centres shown by the green crosses and with contours shown by the green circles. The right-hand plot shows the corresponding feature space (φ1 , φ2 ) together with the linear decision boundary obtained given by a logistic regression model of the form discussed in Section 4.3.2. This corresponds to a nonlinear decision boundary in the original input space, shown by the black curve in the left-hand plot. Fonctions de bases Gaussiennes! [C. Bishop, Pattern recognition and Machine learning, 2006]! an understanding of their more complex 4.3.2 Logistic regression We begin our treatment of generalized linear models by considering the problem of two-class classification. In our discussion of generative approaches in Section 4.2, we saw that under rather general assumptions, the posterior probability of class C1 Modèle (sur entrées transformées) + function non-linéarité la sortie! can be writtenlinéaire as a logistic sigmoid acting on a linear of theàfeature vector linear models by considering the problem φ so ! that ! T " = σ w4.2, φ (4.87) p(C1 |φ) = n of generative approaches iny(φ) Section ! = 1 − p(Cprobability is the logistic with ! p(C ions, the2 |φ) posterior class C1sigmoid function defined by 1 |φ). Here σ(·) of Modélisation directe de this la (4.59). In the terminology of statistics, model is known as logistic regression, probabilité postérieure! on a linear function of the feature vector ! although it should be emphasized that this is a model for classification rather than ROBABILITY ! ! DISTRIBUTIONS regression. " T an feature space φ, this (4.87) model has M adjustable parameters. la -dimensional fonction logistique (« sigmoïde ») assurant que les sorties sont ) = For σ est wM φ Bywhich contrast, if we fitted Gaussian class conditional densities using maximum entre et solve 1! had we0can for µ to give µ = σ(η), where we would have used 2M parameters forby the means and M (M + 1)/2 s likelihood, the logistic sigmoid function defined 1 parameters for the (shared) covariance prior p(C1 ), (2.199) σ(η) = matrix. Together with the class 1 + regression, exp(−η)which grows quadratically with M , isthismodel is known as logistic gives a total of M (M +5)/2+1 parameters, to logistic thefor linear dependence M thewrite number of parameters in logistic s iniscontrast model classification rather than isacalled the sigmoid function. on Thus weofcan the Bernoulli distribution regression. For large values of (2.194) M , there is form a clear advantage in working with the using the standard representation in the logistic regression model directly. p(x|η) = σ(−η) exp(ηx) the parameters of (2.200) , thisWemodel has M adjustable parameters. now use maximum likelihood to determine the logistic regression To this, = weσ(−η), shall make use of the derivative of theComlogistic sigwhere wemodel. have used 1do − σ(η) which is easily proved from (2.199). ass conditional densities using maximum moid function, whichshows can conveniently be expressed in terms of the sigmoid function parison with (2.194) that meters for the means and M (M + 1)/2 itself Régression logistique (2 classes)! 4.3. Probabilistic Discriminative Models 209 Régression logistique (K classes)! 4.3.4 Multiclass logistic regression 4.3.4 Multiclass logistic models regression In our discussion of generative for multiclass classification, we have seen In thatour fordiscussion a large class distributions, the for posterior probabilities are given a of of generative models multiclass classification, we by have softmax ofau linear functions feature variables, so that seen thattransformation for similaire a large class of cas distributions, thethe posterior probabilities are given by a Extension linéaire!of softmax transformation of linear functions of theexp(a feature) variables, so that k (4.104) p(Ck |φ) = yk (φ) = ! exp(a ) ) exp(a « Softmax » pour assurer que kj (4.104) p(Ck |φ) = yk (φ) = !j les sorties sont des exp(a ) j j probabilités ! where the ‘activations’ ak are given by where the ‘activations’ ak are given by ak = wkT φ. Linéarités! (4.105) ak = wkT φ. (4.105) There we used maximum likelihood to determine separately the class-conditional densities the maximum class priorslikelihood and then found the corresponding probabilities There weand used to determine separatelyposterior the class-conditional using Bayes’ implicitly determining the parameters {wkprobabilities }. Here we densities andtheorem, the class thereby priors and then found the corresponding posterior consider the use of maximum likelihood to determine the parameters {wk } of this using Bayes’ theorem, thereby implicitly determining the parameters {wk }. Here we to all of model directly. To do this, we will require the derivatives of yk with respect consider the use of maximum likelihood to determine the parameters {wk } of this the activations aj . These are given by model directly. To do this, we will require the derivatives of yk with respect to all of the activations aj . These are given ∂yk by = yk (Ikj − yj ) (4.106) ∂a j ∂yk = yk (Ikj − yj ) (4.106) ∂a j where Ikj are the elements of the identity matrix. Next we write down the likelihood function. This is most easily done using where Ikj are the elements identity matrix.vector t for a feature vector φ the 1-of-K coding scheme of in the which the target n n Next we write down the likelihood function. This is most easily done using belonging to class Ck is a binary vector with all elements zero except for element k, the 1-of-K scheme in which the is target tn for a feature vector φn which equalscoding one. The likelihood function then vector given by Régression logistique : apprentissage! Un ensemble de données d’apprentissage est supposé connu :! Les entrées , transformées par les fonctions des bases! Les sorties sont connues, codées en « 1-à-K » ! (« hot-one-encoded ») :! ! ! ! Classe « réelle » de l’échantillon n! ! ! ! ! ! ! Objectif : apprendre les paramètres selon un critère d’optimalité! j ∂ykk kj = yk (Ikj − yj ) (4.106) ∂y∂a k j= yk (Ikj − yj ) (4.106) where Ikj are the elements of the identity ∂aj matrix. Next we Iwrite down the likelihood function. This is most easily done using where kj are the elements of the identity matrix. where Next Icoding elements ofthe thelikelihood identity matrix. for aisfeature vectordone φn using the 1-of-K scheme in which the target vector tn This kj are wethe write down function. most easily we write thevector likelihood function. This done belonging to class Ccoding binary with all elements zero isexcept element k,usingφ k is adown for for aeasily feature vector theNext 1-of-K scheme in which the target vector tnmost n for a feature vector φ thebelonging 1-of-K scheme in which the target vector t which equals one.coding The likelihood function is then given by La vraisemblance des données est la probabilité d’obtenir les n n to class Ck is a binary vector with all elements zero except for element k, belonging to class CkThe is ales binary vector with all elements zero for element k, which étant equals one. likelihood function isetthen by except observations donné paramètres lesNgiven entrées :! N K K " "function is then given "" which equals one. The likelihood by t tnk nk p(C |φ ) = y (4.107) p(T|w1 , . . . , wK ) = k N K N K n nk " " " " N " K N " K tnk =1 = k" =1 k" =1 )= p(Ck |φntnk )ntnk ynk (4.107) p(T|w1 , . . . , wnK=1 tnk = n=1 ynk (4.107) p(T|w1 , . . . , wK ) = n=1 p(Ck |φn ) k=1 ×k=1 Kk=1 matrix of target nvariables with elements where ynk = yk (φn ), and T is an N n=1 =1 k=1 negative then gives× K matrix of target variables with elements tnk . Taking = paramètres yk (φlogarithm wherethe ynk Pour estimer les on minimisera une fonction n ), and T is an, N = y (φ ), and T is an N × K matrix of target variables with elements where y nk k n . Taking the negative logarithmde thenlagives tnk(le N K d’erreur logarithme négatif vraisemblance) :! ## tnk . Taking the negative logarithm then gives ln ynk (4.108) E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = − Ntnk# K # N K k# =1 # =− t ln ynk (4.108) E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK )n=1 tnknk ln ynk (4.108) E(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = − n=1 k=1 which is known as the cross-entropy error function for then=1 multiclass classification k=1 problem. which is known as the est cross-entropy error function for the multiclass classification CetteWe fonction de cout connue lerespect nom! which istake known as the cross-entropy errorsous function for thetomulticlass now the gradient of the error function with one of theclassification paramproblem. problem.wj . Making use « of Cross-entropy loss the result (4.106) for the »! derivatives of the softmax eter vectors We now take the gradient of the error function with respect to one of the paramWe takegradient, the gradient of the error function minimisée with respect to :!one of the param4.18 we now obtain Enfunction, calculant son of thepeut result être (4.106) for the derivatives of the softmax eter vectors wj . Making useelle eter vectors wj . Making use of the result (4.106) for the derivatives of the softmax rcise 4.18 function, obtain N # ise 4.18 function, wewe obtain (ynjN− tnj ) φn (4.109) ∇wj E(w1 , . . . , wK ) = # N n=1 # E(w , . . . , w ) = (4.109) ∇ 1 K nj − tnj ) φ j (y(y (4.109) ∇wjwE(w 1 , . . . , wK ) = nj − tnj ) φn n ∂aj Apprentissage des paramètres! [C. Bishop, n=1 Pattern recognition and Machine learning, 2006]! n=1 Application : détection d’objets simples! Une fenêtre de taille 20x20 est glissée sur l’image. ! Les pixels d’une fenêtre sont donnés comme entrées! [A. Mihoub, C. Wolf, G. Bailly, 2014]! La classification supervisée! 44 kppV (k plus proches voisins)! 1. INTRODUCTION 5 4 class densities ! 1.2 p(x|C2 ) p(C1 |x) 1 p(C2 |x) Classifieurs génératifs linéaires (Bayésiens)! 0.8 3 0.6 2 0.4 p(x|C1 ) 1 0 0.2 0 0.2 0.4 0.6 0.8 0 1 0 0.2 0.4 x x 0.6 0.8 ! 1 Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. Classifieurs discriminatifs linéaires! be of low accuracy, which is known as outlier detection or novelty detection (Bishop, 1994; Tarassenko, 1995). However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution p(x, Ck ) when in fact we only really need the posterior probabilities p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006). An even simpler approach is (c) in which we use the training data to find a discriminant function f (x) that maps each x directly onto a class label, thereby combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification. With option (c), however, we no longer have access to the posterior probabilities p(Ck |x). There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include: ! Réseaux de neurones! ! Minimizing risk. Consider a problem in which the elements of the loss matrix are (𝐼, tox)revision from time to time (such as might occur in a financial subjected tree 1 KERNEL MACHINES Forêts aléatoires! 𝑃 (𝑐) 𝑃 ! stration of the ν-SVM for reession applied to the sinusoidal osses indicates the nthetic data set using Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble 1 predicted regression of trees. Each tree consists of split nodes (blue) and leaf nodes enels. theThe offset pixels ve is shown by the red line, and t atures give tube a large ϵ-insensitive corresponds (green). The red arrows indicate the different paths that might be the shaded region. Also, the oa points features at new 0 by different trees for a particular input. are shown in green, taken d those with support vectors e indicated by blue circles. to disambiguate SVM (Support Vector Machines)! 3.3. Randomized decision forests −1 Randomized decision trees and forests [35, 30, 2, 8] have proven fast and effective multi-class classifiers for many 5.1. Feed-forward Network Functions Réseaux de neurones! 227 5.1.Régression Feed-forward Network Functions logistique : les fonctions de bases sont limitées, à choisir manuellement pour chaque and application.! The linear models for regression classification discussed in Chapters 3 and 4, rebased on linear combinations of fixed nonlinear basis functions φj (x) Et spectively, si on les are apprenait?! and take the form ! y(x, w) = f ! !M " wj φj (x) j =1 ! # (5.1) where f (·) is a nonlinear activation function in the case of classification and is the Si les fonctions de base sont de la même forme que les fonctions des identity in the case of regression. Our goal is to extend this model by making the sortie basis:! functions φj (x) depend on parameters and then to allow these parameters to be adjusted, along with the coefficients {wj }, during training. There are, of course, many ways to construct parametric nonlinear basis functions. Neural networks use ! basis functions that follow the same form as (5.1), so that each basis function is itself ! a nonlinear function of a linear combination of the inputs, where the coefficients in ! the linear combination are adaptive parameters. This leads to the basic neural network model, which can be described a series f etofhfunctional sont des fonctions non-linéaires (« fonctions d’activations transformations. First we construct M linear combinations of the»). input form variables x1 , . . . ,ilxs’agit Généralement, sigmoid, de tanh ou de softmax.! D in thede ! ! ! aj = D " (1) (1) Pattern recognition and Machine(5.2) learning, 2006]! wji xi +[C.wBishop, j0 the set of weight parameters by defining an additional input variable x0 whose value is clamped at x0 = 1, so that (5.2) takes the form Réseaux de neurones! ! D (1) aj = wji xi . (5.8) Cela donne un réseau de neurones à deux couches (une couche i=0 cachée, une couche de sortie) :! We can similarly absorb the second-layer biases into the second-layer weights, so « Multi-layer perceptron » (MLP)! that the overall network function becomes "M "D ## ! (2) ! (1) yk (x, w) = σ wkj h j =0 wji xi . (5.9) i=0 ...! As can be seen from Figure 5.1, the neural network model comprises two stages (Les biais ont été intégrés)! of processing, each of which resembles the perceptron model of Section 4.1.7, and for this reason the neural network is also known as the multilayer perceptron, or MLP. A key difference compared to the perceptron, however, is that the neural network uses continuous sigmoidal nonlinearities in the hidden units, whereas the perceptron uses step-function nonlinearities. This means that the neural network function is differentiable with respect to the network parameters, and this property will play a central role in network training. If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear Couche However, Coucheif the number of hidden Couche transformations is itself a linear transformation. cachée! de sortie! units is smaller thand’entrée! either the number of input or output units, then the transforma- Quelques remarques! • Le nombre d’unités dans la couche caché est adaptable.! • Plusieurs couches cachées sont possibles. Nous traiterons les type feed-forward : absence de cycles dans les connexions.! • Les fonctions de sorties sont dérivables par rapport aux paramètres du réseau.! • Si les fonctions d’activations sont linéaires, alors un réseau équivalent peut être trouvé sans couche cachée.! • Les réseaux de neurones peuvent approximer n’importe quelle fonction à une précision arbitraire si la quantité d’unités est suffisante.! .. .! 5. NEURAL NETWORKS gure 5.4 Un exemple! Example of the solution of a simple two- 3 class classification problem involving synthetic data using a neural network having two inputs, two hidden units with 2 ‘tanh’ activation functions, and a single output having a logistic sigmoid activa- 1 tion function. The dashed blue lines show the z = 0.5 contours for each of 0 the hidden units, and the red line shows the y = 0.5 decision surface for the net- −1 work. For comparison, the green line denotes the optimal decision boundary −2 computed from the distributions used to generate the data. −2 −1 0 1 2 symmetries, and thus any given weight vector will be one of a set 2M equivalent Bishop, weight[C. vectors . Pattern recognition and Machine learning, 2006]! Similarly, imagine that we interchange the values of all of the weights (and the bias) leading both into and out of a particular hidden unit with the corresponding values of the weights (and bias) associated with a different hidden unit. Again, this clearly leaves the network input–output mapping function unchanged, but it corresponds to a different choice of weight vector. For M hidden units, any given weight 36 be viewed as performing a nonlinear feature extraction, and the sharing of features between the different save on computation can also lead the to improved evaluations, each ofoutputs whichcan would require O(W )and steps. Thus, computational generalization. effort needed to find the minimum using such an approach would be O(W 3 ). Finally, we consider the standard multiclass problem in whichinformation. each Now compare this with an algorithm thatclassification makes use of the gradient input is assigned to one of K mutually exclusive classes. The binary target variables Because each evaluation of ∇E brings W items of information, we might hope to tk ∈ {0, 1} have a 1-of-K coding scheme indicating the class, and the network find the are minimum of the function inp(t O(W )1|x), gradient evaluations. As:! we shall see, La fonction d’erreur w) =de leading logistique to the following error outputs interpreted as est yk (x,celle kla=régression by using error backpropagation, each such evaluation takes only O(W ) steps and so function NO(W K 2 ) steps. For this reason, the use of gradient the minimum can now be found in ! ! w).training neural networks. (5.24) information forms theE(w) basis = of−practical talgorithms kn ln yk (xn ,for Apprentissage : ! descente de gradient! n=1 k=1 5.2.4 Gradient descent optimization Minimisation itérative par descente de gradient (un pas dans la The simplest approach to using gradient information is to choose the weight direction du plus grand achangement) update in (5.27) to comprise small step in the:!direction of the negative gradient, so that w(τ +1) = w(τ ) − η∇E(w(τ ) ) (5.41) where the parameter η > 0 is known as the learning rate.(Learning After each Vitesse rate)!such update, the gradient is re-evaluated for the new weight vector and the process repeated. Note that 5. NEURAL NETWORKS Possibilité de blocage un minimum local the error function is defined dans with respect to a training set, :!and so each step requires Figure 5.5 Geometrical view of the error function E(w) as E(w) thata the entire training set be processed in order to evaluate ∇E. Techniques that surface sitting over weight space. Point w is local minimum and w is the global minimum. use Atathe whole set ofattheonce are called batch methods. At each step the weight any point w , thedata local gradient error surface is given by the vector ∇E. vector is moved in the direction of the greatest rate of decrease of the error function, and so this approach is known as gradient descent or steepest descent. Although such an approach might intuitively seem reasonable, in fact it turns out to be a poor w algorithm, for reasons discussed in Bishop and Nabney (2008). w w w For batch optimization, there are more efficient methods, such as conjugate graw ∇E [C. Bishop, Pattern recognition and Machine learning, 2006]! dients and quasi-Newton methods, which are much more robust and much faster A B C 1 A 2 B C e minimum can now be found in O(W ) steps. For this reason, the use of grad 1999). Unlike gradient descent, these algorithms have the property that the error dient-based algorithm multiple eachthe time using ahasdifferent randomly ormationfunction forms the basis of atpractical algorithms for training neural always decreases eachtimes, iteration unless weight vector arrived at a networks or global starting local point, andminimum. comparing the resulting performance on an independen In order to finddescent a sufficiently optimization good minimum, it may be necessary to run a 5.2.4 Gradient ion set. gradient-based algorithm multiple times, each time using a different randomly choThere is,senhowever, version ofperformance gradient that has proved starting point,an andon-line comparing the gradient resulting ondescent an independent valiThe simplest approach to using information is to choose the we dation set. Batch toutes les données sontstep utilisées pour chaque étape! practice for: There training neural networks on large data sets (Le Cun et al., 1 date in (5.27) to comprise a small in the direction of negative is, however, an on-line version of gradient descent that hasthe proved useful gradien ! based maximum likelihood for sets a set observ in practice for on training neural networks on large data (Le of Cunindependent et al., 1989). ator functions τfor +1) each (τ ) point (τ ) Error functions basedone on(maximum likelihood for a set of independent observations ! a sum mprise of terms, data = w − η∇E(w ) (5 w comprise a sum of terms, one for each data point ! Deux stratégies : batch vs. en ligne! here the parameter η > 0 is known as the learning rate. After each such update NN ! ! ! En (w). and the process repeated. (5.42) E(w) = adient is re-evaluated for the new weight Note En (w). E(w) =n=1vector e error! function is defined with respect nto=1a training set, and so each step requ On-line gradient descent, also known as sequential gradient descent or stochastic at the! entire training set be processed in order to evaluate ∇E. Techniques gradient descent, makes an update to the weight vector based on one data point at a ligne (stochastic gradient : le gradient est calculé pour gradient descent, also known as batch sequential gradient descent or e-line the En whole are descent) called methods. At each stepun thestoc we time,data so thatset at once ( τ +1) ( τ ) ( τ ) pointin (une àwchaque :! based of =to − weight η∇Eitération (w vector ). (5.43) w of the adient descent, makes an donnée) update the onthe oneerror datafunc poin nrate ctor isseul moved theseule direction greatest of decrease so! this thatapproach is known as gradient descent or steepest descent. Altho de,so (τ ) w(τ )reasonable, − η∇En (w ). it turns out to be a p w(τ +1) =seem ch an! approach might intuitively in fact gorithm, for reasons discussed in Bishop and Nabney (2008). ! For batch optimization, there are more efficient methods, such as conjugate ents and quasi-Newton methods, which are much more robust and much fa an simple gradient descent (Gill et al., 1981;[C. Fletcher, 1987; Nocedal and Wr Bishop, Pattern recognition and Machine learning, 2006]! error function is to use finite differences. This can be done by perturbing each weig in turn, and approximating the derivatives by the expression Calcul du gradient! En (wji + ϵ) − En (wji ) ∂En + O(ϵ) = La minimisation de∂w l’erreur nécessite le calcul de son gradient ! ϵ ji (5.6 where ϵ ≪ 1. In a software simulation, the accuracy of the approximation to t derivatives can be improved by making ϵ smaller, until numerical roundoff proble arise. The accuracy of the finite differences method can be improved Chaque dérivée partielle pourrait (en théorie) se calculer par une significan by différence using symmetrical finie : ! central differences of the form En (wji + ϵ) − En (wji − ϵ) ∂En + O(ϵ2 ). = ∂wji 2ϵ (5.6 Problème : efficacité très mauvaise. Pour donnée,by pour W expansion In this case, the O(ϵ) corrections cancel, as chaque can be verified Taylor paramètres (poids), nous avons 2*W « stimulations » à faire, chacune 2 the demandant right-hand side of (5.69), and so the residual corrections are O(ϵ ). The numb 2 W calculs -> O(W )! of computational steps is, however, roughly doubled compared with (5.68). ! The main problem with numerical differentiation is that the highly desirab O(W ) scaling has been lost. Each forward propagation requires O(W ) steps, a .. .! [C. Bishop, Pattern recognition and Machine learning, 2006]! Here we shall consider the problem of evaluating ∇En (w) for one such te n be accumulated overNthe training set in the casefor ofsequential batch methods. error function. This may be used directly optimization, or th ! linear com Consider first a simple linear model in which the (5.44) outputs yk are E E(w) can be=accumulated case of batch methods. n (w). over the training set in the n=1 xfirst Consider simple linear model in which the outputs yk are linear c ons of the input variables i soathat tions the input ∇E variables x! i so that consider the problem ofof evaluating n (w) for one such term in the = dewmanière yk optimization, est be possible de calculer le gradient directe.! . ThisIlmay used directly for sequential ori! the results ki x Retro-propagation du gradient (1)! wki xi yk = ulated Exemple over the training set in the case of batch methods. très simple : fonction d’activation linéaire (à la sortie), erreur i i first a de simple model in which the outputs yk are linear combinatypelinear « somme des carrés » :! together with an error that, for a particular input pattern n, takes gether withxan that,function for a particular input pattern n, takes theth that function put variables i soerror ! ! 1 ! 2 1 En = wki xi yk = (ynk(5.45) 2 − tnk ) (ynk2− tnk ) En = i 2 k .. .! k an error function that, for a particular input pattern n, takes the form = y (x , w). The gradient of this error function with respect to where y nk k n Pour les poids ! de la couche de sortie, on peut donner les dérivées given by gradient = yEk (xw=nji:!,1isw). The of this error function with respect to a here ynk directement (ynk − tnk )2 (5.46) n ∂En 2 ji is given by = (ynj − tnj )xni k ∂w ji ∂E n yk (xn , w). The gradient this function with respect to a weightinvolving the product of =as(y − tcomputation whichof can beerror interpreted a nj ‘local’ nj )xni y ji associated with the output end of the link wji and the var signal’ ynj − tnj∂w ∂En associated with the end of the link.involving In(5.47) Sectionthe 4.3.2, we sawof how = (ynjas − atnj‘local’ )xniinputcomputation hich can be interpreted product an ∂wformula ji ariseswith with the logistic sigmoid activation function together with − t associated the output end of the link w and the variab gnal’ y nj as a nj interpreted ‘local’ computation involving productfor of an entropy error function, andthe similarly the‘error softmax ji activation function sociated withwith the end ofcross-entropy the In Section we saw how a s tnj associated theinput output end of the linklink. wji and the variable4.3.2, xWe ni shall now see how thi with its matching error function. th the input endwith ofresult thethe link. In Section 4.3.2, we sawsetting how aof similar rmula arises logistic sigmoid activation function together withnetw the extends to the more complex multilayer feed-forward [C. Bishop, Pattern recognition and Machine learning, 2006]! s with the logistic sigmoid function together witheach the cross In a activation general feed-forward network, unit computes a weighted su 5.3. Error Backpropagation 243 ow yofvector information through the network. Substituting (5.51) and (5.52) into (5.50), we then obtain input to the network and calculated the activations of all of − t associated with the output end of the link w and the variable x nal’ nj nj ji the input endniof the link. In associated with with respect to aa weight ider the evaluation of the derivative of E n 4.3.2, output with units the in the network application of we (5.48) ociated input end ofby thesuccessive link. In Section sawand how similar formula arises with the logistic activ ∂E of various units will depend onaoutput-unit the n. sigmoid ninput ocess is the often called propagation because itparticular be regarded mula arises with the forward logistic sigmoid activation function together withpattern the cross eutsusing the canonical link as the activation function. To where zi is the activation of unit, orcan input, that sends a connection to unit j, and wj = δj zi . w of error information through the associated network. opy function, and similarly forentropy the softmax activation function together rder tohidden keep the uncluttered, weuse shall omit the subscript n sawpartial ∂w error function, and for the isunits, thenotation weight with that connection. In Section 5.1,similarly we that biases can ji shder for we again make of the chain rule for with to asee weight evaluation of the of E Eby itsthe matching cross-entropy error function. We respect shallon now how simple the weight wji only ork variables. First wederivative note that be included in this sum introducing an extra unit,this or input, with activation fixed nn depends with its matching cross-entropy error functio ult theatmore setting of multilayer feed-forward networks. ts extends of the various units will depend ontells particular input pattern n. explicitly. Il manque lesj.complex dérivées par rapport aux paramètres despartial couches Equation (5.53) us that thedeal required derivative is obtained by ism +1. We therefore dothe not need to with biases The sumsimply in (5.48) We can therefore apply the chain rule for ed input ato ! j to unit ∂a ∂E result extends to the more setting In atogeneral feed-forward network, each unit computes aunité weighted sum of its der keep the notation uncluttered, we ∂E shall omit subscript nofale nat kthe the value δnonlinear for the unit theune output end ofntothe weight by the value ofofzj cachées. Rappelons calcul fait par caché :!complex of unit transformed by activation function h(·) give the activation z j give δjwe ≡the = (5.55) uts of the formFirst depends ona the weight w only rk variables. noteinput that end En ! ji at of the weight (where z = 1 in the case of a bias). Note tha In general feed-forward network, each in the form ∂a ∂a ∂a ∂a ∂E ∂E ! j k j j n n Wesame can= therefore chainz rule for partial d input aj to unit j.the wjithe (5.48) at the start(5.49 aj = askapply .zthe (5.50) i simple = h(a ). form for linear considered of th j form j model inputs of the ∂wjiin order ∂aj to∂w ive ji !the i evaluate Thus, the derivatives, we need only to calculate runs over all units k to which unit j sends connections. The arrangein the sum in (5.48) could be an input, and Note that one or more of the variables z ∂a ∂En ∂Encalculera i j Pour simplifier, on d’abord les dérivées par rapport à ! = a .inand (5.50) and then apply j (5.53). w for = each hidden output unit in the network, duce a useful notation similarly, the unit j (5.49) could be an output. ∂aj ∂win nd weights is∂willustrated ji ji Figure 5.7. Note that the units labelled k As we have output units, that we have ! i the For each pattern seen in thealready, training for set, the we shall suppose we have supplied ∂E uce notation n thera useful hidden units and/or output units. In writing down (5.55), we areof all o Pour la couche de sortie :! activations corresponding and calculated the δj ≡input vector to the network (5.51) =successive yk −in tk the δkby ∂a rise variations error funche fact that the variations ajj give hidden ∂E andnin output units in the to network application of (5.48) and δ ≡ (5.51) j This process is often called forward propagation because it can be regarded gh variations(5.49). in the variables ak . If we now substitute the definition ∂aj are often referred to as errors for reasons shallthesee shortly. Using as a forward flow of information we through network. 5. NEURAL Chaque dérivée peut être use calculée à partir des(5.49), dérivées la couche 5.51) intoNETWORKS (5.55), and make of (5.48) and wedeobtain the Now consider the evaluation of the derivative rewrite often referred to as errors for reasons we shall see shortly. Usingof En with respect to a weigh :! propagation formula ∂aj of the various units will depend on the particular input pattern n write suivantew ji . The outputs ure 5.7 Illustration of the∂a calculation of hidden unit j by =δjto zfor (5.52) i .keep ! backpropagation j in However, order the notation uncluttered, we shall omit the subscript n of the δ’s from those units k to which ∂w zi = ji zi . (5.52) δk ! unit j sends connections. arrow denotes the note that Enδjdepends on the weight wji only from thejinetwork variables. First we ∂w ′The blue wji wkj = flow h (a w δ (5.56) δj summed direction of information during forwardobtain propagation, j )then kj k 5.51) and (5.52) into (5.50), we to unit j. We can therefore apply the chain rule for partia via the input a j the red indicate the backward 51) andand (5.52) intoarrows (5.50), we then obtain propagation zj derivatives to give k of error information. δ1 ∂En ∂En ∂aj ∂En∂En = (5.50 (5.53) = δj z= (5.53). i . δj zi . ∂w ∂a ∂w that the value of for unit can be obtained by2006]! ji [C. Bishop, j Pattern ji ∂wjiδ∂w recognition and Machine learning, ji a particular hidden Retro-propagation du gradient (2)! provided we are using the canonical link as the output-unit activation function. To Retour : généralisation, ! sélection de modèles! Complexité du modèle de prédiction : ! • Nombre de couches cachées! • Nombre d’unités cachées par couche! 5.5. Regularization in Neural Networks M =1 1 M =3 1 0 0 −1 −1 −1 0 1 0 M = 10 1 0 1 257 0 1 Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. The graphs show the result of fitting networks 1 couche cachée avec M having unités!M = 1, 3 and 10 hidden units, respectively, by minimizing a sum-of-squares error function using a scaled conjugate-gradient algorithm. Données générées avec! of the form λ ! E(w) = E(w) + wT w. (5.112) 2 [C. Bishop, Pattern recognition and Machine learning, 2006]! This regularizer is also known as weight decay and has been discussed at length Arrêt prématuré (early stopping)! L’apprentissage est itérative (1 itération est appelée 1 « epoch »).! On peut diminuer le sur apprentissage (overfitting) en arrêtant l’apprentissage lorsque l’erreur sur base de validation commence261 à 5.5.la Regularization in Neural Networks augmenter.! ! 0.45 0.25 0.4 0.2 0.15 0 10 20 30 40 50 0.35 0 10 20 30 40 50 Figure 5.12 An illustration of the behaviour of training set error (left) and validation set error (right) during a typical training a function of the iteration step, for the sinusoidal data de set.validation! The goal of achieving Erreur sursession, la baseasd’apprentissage! Erreur sur la base the best generalization performance suggests that training should be stopped at the point shown by the vertical dashed lines, corresponding to the minimum of the validation set error. [C. Bishop, Pattern recognition and Machine learning, 2006]! parameter λ. The effective number of parameters in the network therefore grows ft-to-right ght-to-left p-to-down own-to-up ick up ang up Application : ! reconnaissance de gestes! he data is stored for each frame and each training vector is dicate the label by pressing the button. So, this process is a ground ther the proper objective data for the training. d, we can update the neural network with a supervised learning as we to its respective feature vectors. ew data is a ground truth set, we can not update the Filter neural ded is labelled as a gesture. As we need gesture samples and e Filter neural network, the only classifier that is trained is the General Classification de gestes à partir de la pose articulée d’un humain (positions des articulations estimées à l’aide d’un capteur Kinect).! Réseau de neurones! ng of new vector labelling by pressing the correspondent key Poses consécutives! (Positions des articulations) ! 12 Caractéristiques invariantes (point de vue, personnes etc.)! Geste! La classification supervisée! 44 kppV (k plus proches voisins)! 1. INTRODUCTION 5 4 class densities ! 1.2 p(x|C2 ) p(C1 |x) 1 p(C2 |x) Classifieurs génératifs linéaires (Bayésiens)! 0.8 3 0.6 2 0.4 p(x|C1 ) 1 0 0.2 0 0.2 0.4 0.6 0.8 0 1 0 0.2 0.4 x x 0.6 0.8 ! 1 Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. Classifieurs discriminatifs linéaires! be of low accuracy, which is known as outlier detection or novelty detection (Bishop, 1994; Tarassenko, 1995). However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution p(x, Ck ) when in fact we only really need the posterior probabilities p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006). An even simpler approach is (c) in which we use the training data to find a discriminant function f (x) that maps each x directly onto a class label, thereby combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification. With option (c), however, we no longer have access to the posterior probabilities p(Ck |x). There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include: ! Réseaux de neurones! ! Minimizing risk. Consider a problem in which the elements of the loss matrix are (𝐼, tox)revision from time to time (such as might occur in a financial subjected tree 1 KERNEL MACHINES Forêts aléatoires! 𝑃 (𝑐) 𝑃 ! stration of the ν-SVM for reession applied to the sinusoidal osses indicates the nthetic data set using Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble 1 predicted regression of trees. Each tree consists of split nodes (blue) and leaf nodes enels. theThe offset pixels ve is shown by the red line, and t atures give tube a large ϵ-insensitive corresponds (green). The red arrows indicate the different paths that might be the shaded region. Also, the oa points features at new 0 by different trees for a particular input. are shown in green, taken d those with support vectors e indicated by blue circles. to disambiguate SVM (Support Vector Machines)! 3.3. Randomized decision forests −1 Randomized decision trees and forests [35, 30, 2, 8] have proven fast and effective multi-class classifiers for many 29 |y − t|q Bases : théorie de l’information! receive no information. Our measure of information on the probability distribution p(x), and we therefo h(x) = − log p(x) (1.92) 2 on the probability distribution p(x), and we p(x) therefo is a monotonic function of the probability an Entropie h(x) d’une variable aléatoire x : mesure du contenu iscontent. a monotonic function of the probability p(x) and where the negative sign ensures that information isof positive or zero. that The form h(·) can be Note found bylow notin d’information.! probability events x correspond to high information content. ofbybasis content. The form of h(·) canThe be choice found noting 0 are and y that unrelated, then the information gai 1 isde 2 variables −2the −1 we shall 0 adopt the 1 convention 2 Peut dérivé deux indépendantes :! for the0être logarithm arbitrary, and for moment and y that are unrelated, then the information gain should be the sum of the information gained from y − t y − t prevalent in information theory of using logarithms to the base of 2. In this case, as should the sum+of the information gained from h(x, y)be=are h(x) h(y). Two unrelated events wille we shall see shortly, the units ofq h(x) bits (‘binary digits’). y) =y)h(x) + h(y). Two events will PlotsNow of thesuppose quantitythat Lq = |yh(x, − for various values ofthe q. value a sender wishes to=transmit ofunrelated a these randomtwo variable to sot| p(x, p(x)p(y). From relationsh somust p(x, y) = p(x)p(y). From these twoprocess relationshi a receiver. average amount ofbe information that logarithm they transmit the ExerciseThe 1.28 given by the ofinp(x) and so iswe h Exercise must beofgiven the logarithm of p(x) and soand we ha obtained by 1.28 taking the expectation (1.92)by with respect to the distribution p(x) h(x) = − log2 p(x) (1.92) is given by ! 52 1.sign INTRODUCTION p(x) log2isp(x). (1.93) H[x] = − information where the negative ensures that positive or zero. Note that low x information probability events0.5 x correspond to high content. The choice of basis 0.5 for logarithm is arbitrary, andthe forentropy the moment shall variable adopt thex. convention Thisthe important quantity is called of the we random Note that H = 1.77 H = 3.09 prevalent in information theory of using logarithms to the base of 2. In this case, as lim p→0 p ln p = 0 and so we shall take p(x) ln p(x) = 0 whenever we encounter a we shall shortly, the units of h(x) are bits (‘binary digits’). value forsee x such that p(x) = 0. 0.25 0.25 Now a sender wishes to transmit the value a randomofvariable to So farsuppose we havethat given a rather heuristic motivation for theofdefinition informaa receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of (1.92) with respect to the distribution p(x) and is given by 0 0 ! [C. Bishop, Pattern Figure 1.30 Histograms of two probability distributions over 30 bins illustrating the higher valuerecognition of the entropy and Machine learning, 2006]! H for the broader distribution. The largest from alog uniform distribution that would give H = p(x) p(x). (1.93) H[x] =entropy − would arise probabilities −1 received moreofand information thanthat if we Plots of the quantity Lq = |yhave − t|q for various values q. if we knew has just occurred, the were event to w 1 information. has just occurred, and if we that of theinformatio event w receive no Ourknew measure probabilities 29 Arbres de décision! (𝐼, x) tree 1 Classification extrêmement rapide!! 𝑃 (𝑐) 𝑃 sses indicates the Figure 4. Randomized Decision Forests. A forest is an • A chaque nœud d’un arbre, une valeur du vecteur est examinée ! the •offset pixels Eachcomparaison tree consists (blue) and Parcours à gauche of ou trees. droite selon avecof unsplit seuil nodes (différent pourachaque ures give large sommet)! (green). The red arrows indicate the different paths tha • Chaque feuille contient uneby décision (unetrees classe features at new taken different for a)! particular input. o • Paramètres :! 3.3. Randomized decision forests • Choix des éléments pour chaque sommet! • Seuil pour chaque Randomized sommet! disambiguate decision trees and forests [35, 30, • pour chaque feuille! proven fast and effective multi-class classifiers Forêts aléatoires! (𝐼, x) tree 1 (𝐼, x) tree 1 𝑃 (𝑐) 𝑃 𝑃 (𝑐) 𝑃 cates es. The theyellowFigure crosses indicates the sous-ensembles Randomized FigureForests. 4. Randomized A forest Decision an ensemb Fo Plusieurs arbres sont4.entrainés sur des Decision différents des is données! et redpixels circles the offset pixels of contient trees. Each tree consists ofoftrees. split nodes Each tree (blue) consists and leaf of split nod Chaqueindicate feuille une distribution d’étiquettes! ! example(green). ee atwo large features The givered a large arrows indicate (green). theThe different red arrows paths indicate that might theb ! (b), at new the sametaken two features at new by different trees for ataken particular by different input. trees for a particul Addition des distributions correspondantes :! maller response. 3.3. Randomized decision 3.3. Randomized forests decision fo biguate w the classifier Randomized to disambiguate decision trees Randomized and forestsdecision [35, 30, 2, trees 8] hav an e body. proven fast and effectiveproven multi-class fast and classifiers effectiveformultiman , (1)pixel x, the features compute a given 1 X the forest ◆ ✓ ◆ P (c|I, x) = all treesPint (c|I, x) . to give the final (2) classification u v T t=1 T xparame+ dI x + , (1) X 1 dI (x) dI (x) P (c|I, x) = Pt (c|I, x) . (2) alization (𝐼, x)Training. Each tree is trained on a different set of randomly T t=1 depth at pixel x in image I, and parameh invariChaque est appris de2000 manière indépendante ! synthesized images.arbre A random subset of example pixcribe offsets u and v. The normalization Training. Each tree is trained on a different set of randomly ee ce1ensures offset the features Apprentissage couche par couche è le gradient de els from each image is chosen to ensure a roughly even disare depth invarisynthesized images. A random subset of 2000 example pixx) camera. across Each tree image is trained usingtothe l’erreur n’estparts. pas t on the body, a tribution fixed world space body offset elsdisponible!!! from each is chosen ensure a roughly even disdulo per-is closefollowing the pixel or far from the camera. 𝑃! algorithm [20]: tribution across body parts. Each tree is trained using the kground us 3D translation 𝑃invariant peralgorithm [20]: 1.(𝑐)Randomly propose a following set of splitting candidates = ensemble 1.(modulo Pour un couche donnée, proposition d’un 0 ean dIoffset (x ) pixel lies on 1. ✓Randomly propose⌧ a). set of splitting candidates = (✓,the ⌧ ) background (feature parameters and thresholds de candidats de paramètres 0 ds of4.the image, theDecision depth probe dI (x ) gure Randomized Forests. A forest is an ensemble (✓,,⌧seuil ) (feature )!parameters ✓ and thresholds ⌧ ). (choix élément 2. Partition the set of examples Q tive constant value. trees. Each tree consists of split nodes (blue) and leaf nodes = {(I, x)} into left ocations 2. données Partition set of examples Q en = {(I, x)} into left 2. Séparation des 2 parties! andpixel right subsets paths by each : be thed’apprentissage two features at different locations een). The red arrows indicate the different that might rge posand right subsets by each : by different trees for agive particular input. ksenupwards: Eq. 1 will a large posQ ( ) = { (I, x) | f✓ (I, x) < ⌧ } (3) dy, but a Ql ( ) = { (I, x) | f✓ (I, x) < ⌧ } (3) ixels x near the top of the body,l but a 3. Randomized decision forests dy. pixels Fea- x lower down the body. Qr ( Fea) = Q \ Ql ( ) Qr ( ) = Q \ Q(4) for (4) l( ) ! res such decision and forests dRandomized help find thin verticaltrees structures such[35, 30, 2, 8] have oven fast and effective multi-class many gain 3. Compute the classifiers giving the information: 3.forlargest Compute the in giving the largest gain in information: 3. Choix des paramètres candidats maximisant le gain ks [20, 23, 36], and can se provide onlybea implemented weak signal efficiently on the ? k features signal ? = argmax G( (5) ) (5) = ) (mesure d’entropie)! PU [34]. illustrated in Fig.en 4, information a forest is an G( ensemble f the body pixel belongs to, but inargmax o, but inAsthe cision foresttrees, they each are sufficient to accuT consisting of split and leafX nodes. X |Qs ( )| to decision accu|Q ( )| H(Q) ( )) (6) H(Qs ( )) (6) allsplit trained Theofdesign ch nodeparts. consists a feature f✓ H(Q) and a thresholdG(⌧ . ) s = H(Q G( ) of=these of these s |Q| s2{l,r} yclassify motivated by their computational effiand re- |Q| onal effi-pixel x in image I, one starts at the roots2{l,r} essing needed; Eq. each1,feature need left onlyor right according atedly isevaluates branching where Shannon entropy H(Q) ... is computed on the noreed only Shannon entropie! ethe pixels and perform atwhere most 5Shannon comparison to threshold ⌧arithmetic . At theentropy leaf nodeH(Q) reached is computed on the normalized histogram of body part labels lI (x) for all ithmeticcan be straightforwardly 4. PRécursion! features impletree t, a learned distribution (c|I, x) over body partpart la- labels l (x) for all malized thistogram of body Apprentissage des paramètres! pman Andrew Blake dge Application & Xbox Incubation: estimation - “R”: Right hand is used for perform the gestures. - “L”: Left hand is used for perform the gestures. - “B”: Both hands are used for perform the gestures. de la pose humaine (MS Kinect)! After that, we can collect data of a desired class everytime we want by pressing th button on your computer keyboard. The buttons for collect each class data are: - “1“: Stop - “2”: Scroll: left-to-right - “3”: Scroll: right-to-left - “4”: Scroll: up-to-down - “5”: Scroll: down-to-up - “6”: Phone: pick up - “7”: Phone: hang up - “8”: Hi While the button is pressed , the data is stored for each frame and each training ve automatically labeled as we indicate the label by pressing the button. So, this proc truth collection because we gather the proper objective data for the training. Once we have new data stored, we can update the neural network with a supervise have the labels corresponding to its respective feature vectors. Taking into account that the new data is a ground truth set, we can not update the network because all data recorded is labelled as a gesture. As we need gesture sam no-gesture samples to train the Filter neural network, the only classifier that is tra neural network. Figure 9. Recording of new vector labelling by pressing the correspondent key depth image body parts 3D joint proposals [J. Shotton et al., 2011]! Figure 1. Overview. From an single input depth image, a per-pixel synthetic (train & test) Données d’apprentissage! Figure 2. Synthetic and real data. Pairs of depth image and ground truth body p 31 parties du corps humains! the tasksynthétiques of background which we as1simplify Million d’images issuessubtraction d’une pipeline de rendu (la moitiethe à app partir capture mouvements humains)! for our approach, sumed’une in this work.deBut most importantly While d 2000 de chaque to image = 2.000.000.000 vecteurs d’apprentissage! it is pixels straightforward synthesize realisticde depth images of plicitly 3people arbres,and chacun 20 niveaux! thusde build a large training dataset cheaply. be enco Chaque feuille : 1 histogramme de 31 valeurs = 32.505.856 valeurs! data to 2.2. Motion capture data 220-1 = 1.048.575 seuils! The ! The human body is capable of an enormous range of parame parts for left and right allow the classifier to disambiguate the left and right sides of the body. Of course, the precise definition of these parts could be changed to suit a particular application. For example, in an (a)upper body tracking scenario, (b)all the lower body parts could 𝜃 be merged. Parts should be sufficiently 𝜃small to accurately localize body joints, but not too numerous as to waste capacity of the classifier. 𝜃 Depth image features 3.2. 𝜃 We employ simple depth comparison features, inspired by those in [20]. At a given pixel x, the features compute ◆yellow✓crosses indicates ◆ ✓ Figure 3. Depth image features. The the u v f✓ (I,classified. x) = dI The x + red circles indicate dI x +the offset pixels , (1) pixel x being dI (x) dI (x) as defined in Eq. 1. In (a), the two example features give a large where dI (x) is the depth at pixel x in image I, and paramedepth difference response. In (b), the same two features at new ters ✓ = (u, v)a describe offsets u and! v. The normalization X …. # #position du pixel, image locations give much smaller response. u,v … of the offsets by d# I 1(x)#offsets! ensures the features are depth invaridI …. # #profondeur! ant: at a given point on body, a fixed world space offset parts for left and right allowthethe classifier to disambiguate pixel is close or far from the camera. the leftwill andresult rightwhether sides ofthethe body. The features are thus 3D translation invariant (modulo per- Les caractéristiques! p ta G o E T p to in b aF o ( t T s3 e trp La classification supervisée! 44 kppV (k plus proches voisins)! 1. INTRODUCTION 5 4 class densities ! 1.2 p(x|C2 ) p(C1 |x) 1 p(C2 |x) Classifieurs génératifs linéaires (Bayésiens)! 0.8 3 0.6 2 0.4 p(x|C1 ) 1 0 0.2 0 0.2 0.4 0.6 0.8 0 1 0 0.2 0.4 x x 0.6 0.8 ! 1 Figure 1.27 Example of the class-conditional densities for two classes having a single input variable x (left plot) together with the corresponding posterior probabilities (right plot). Note that the left-hand mode of the class-conditional density p(x|C1 ), shown in blue on the left plot, has no effect on the posterior probabilities. The vertical green line in the right plot shows the decision boundary in x that gives the minimum misclassification rate. Classifieurs discriminatifs linéaires! be of low accuracy, which is known as outlier detection or novelty detection (Bishop, 1994; Tarassenko, 1995). However, if we only wish to make classification decisions, then it can be wasteful of computational resources, and excessively demanding of data, to find the joint distribution p(x, Ck ) when in fact we only really need the posterior probabilities p(Ck |x), which can be obtained directly through approach (b). Indeed, the classconditional densities may contain a lot of structure that has little effect on the posterior probabilities, as illustrated in Figure 1.27. There has been much interest in exploring the relative merits of generative and discriminative approaches to machine learning, and in finding ways to combine them (Jebara, 2004; Lasserre et al., 2006). An even simpler approach is (c) in which we use the training data to find a discriminant function f (x) that maps each x directly onto a class label, thereby combining the inference and decision stages into a single learning problem. In the example of Figure 1.27, this would correspond to finding the value of x shown by the vertical green line, because this is the decision boundary giving the minimum probability of misclassification. With option (c), however, we no longer have access to the posterior probabilities p(Ck |x). There are many powerful reasons for wanting to compute the posterior probabilities, even if we subsequently use them to make decisions. These include: ! Réseaux de neurones! ! Minimizing risk. Consider a problem in which the elements of the loss matrix are (𝐼, tox)revision from time to time (such as might occur in a financial subjected tree 1 KERNEL MACHINES Forêts aléatoires! 𝑃 (𝑐) 𝑃 ! stration of the ν-SVM for reession applied to the sinusoidal osses indicates the nthetic data set using Gaussian Figure 4. Randomized Decision Forests. A forest is an ensemble 1 predicted regression of trees. Each tree consists of split nodes (blue) and leaf nodes enels. theThe offset pixels ve is shown by the red line, and t atures give tube a large ϵ-insensitive corresponds (green). The red arrows indicate the different paths that might be the shaded region. Also, the oa points features at new 0 by different trees for a particular input. are shown in green, taken d those with support vectors e indicated by blue circles. to disambiguate SVM (Support Vector Machines)! 3.3. Randomized decision forests −1 Randomized decision trees and forests [35, 30, 2, 8] have proven fast and effective multi-class classifiers for many SVM (« Support Vector Machines »)! L’hypothèse de départ est que les données sont linéairement séparables (cette hypothèse sera partiellement relâchée)! Wikipedia! Pour un ensemble de points linéairement séparables, il existe une infinité d'hyperplans séparateurs! SVM : la marge! On cherche obtenir l’hyperplan ayant la marge maximale (distance entre les échantillons de classes différentes).! Objectif : réduire l’erreur sur les données non vues.! SVM : « kernel trick »! Certaines données ne sont pas linéairement séparables.! Astuce : projection des données dans un espace où elles le sont.! SVM : applications! Les SVM ont été utilisées pour quasiment toutes les applications en image / vision par ordinateur.! Quelques applications développées au LIRIS :! Reconnaissance d’actions individuelles! [Wolf et Taylor, 2010]! Reconnaissance d’actions collectives! Reconnaissance de textes! [Baccouche/Mamalet/Wolf/ Garcia/Baskurt, 2012]! [Wolf et Jolion, 2003]! Popularité des classifieurs! Google N-gram viewer, requête le 24.7.2014 :! Google fight ! (24.7.2014)!