Telechargé par mikoghanemso123

Regression: Linear Regression with One and Multiple Variables

publicité
Regression
 Linear Regression with one variable
◦ Model representation
◦ Cost function
◦ Gradient descent
 Linear Regression with multiple variables
◦ Multiple features
◦ Feature scaling
◦ Learning rate
1
Model Representation
 Consider the house pricing example:
◦ Supervised learning: the “right answer” is given
for each example in the data.
◦ Regression problem: Predict real-valued output.
2
Model Representation
 Given this data set, predict the price of a house
(example house of size 1250 square).
 To do this estimation, fit a model: may be a straight
line
 Based
on the fitted line, the price of the 1250
square house is:
3
Model Representation
 In
supervised learning, there is a data set
called training set.
 For the example of the housing prices:
Size in feet2 (x)
Price ($) in 1000's (y)
2104
460
1416
232
1534
315
852
178
…
…
 Goal
of ML: learn from this data how to
predict prices of the houses.
4
Mathematical Notation
 m = Number of training examples
 x’s = “input” variable / features
 y’s = “output” variable / “target” variable
 X: the space of input values.
 Y: the space of output values.
 (x,y): one training set
 (x(i),y(i)): the ith training example.
 A single row in this table corresponds to a
single training example.
 Example: in the above table: y(3) = 315.
5
How does the supervised learning
algorithm work?
 We “feed” the training set to our learning
algorithm.
 Job of the learning algorithm: to output a
function denoted h (stands for hypothesis).
 Job
of the hypothesis in the example of
house pricing: it is a function that takes as
input the size of a house (the value of x) and
outputs the estimated value of the price y.
 So h is a function that maps from x's to y’s.
6
How does the supervised learning
algorithm work?
 More formally: given a training set, to learn a
function h: X → Y so that h(x) is a “good”
predictor for the corresponding value of x.
7
Why linear model?
 We
will start first with fitting linear
functions, and we will build on this to
eventually have more complex models
(non-linear models), and more complex
learning algorithms.
y
x
8
Linear model hypothesis
 Try to fit a straight line into the training set
using:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
 Short-hand: we use ℎ 𝑥
instead of ℎ𝜃 𝑥 .
 𝜃𝑖 ’s: parameters.
 One
feature ⇒ one x variable ⇒ Linear
regression with one variable or univariate
linear regression.
9
Cost function
 Different
choices of the parameter's 𝜃0
and 𝜃1
→ different hypothesis functions
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥.
ℎ𝜃 𝑥 = 1.5
ℎ𝜃 𝑥 = 0.5 𝑥
ℎ𝜃 𝑥 = 1 + 0.5 𝑥
10
Cost function
 Regression: Choose 𝜃0 and 𝜃1 so that ℎ𝜃 𝑥
is
close to y of the training examples (x,y).
 Cost function: (squared error function)
𝐽 𝜃0 , 𝜃1
𝑚
1
=
ℎ𝜃 𝑥 𝑖
෍
2𝑚
𝑖=1
−𝑦
𝑖
2
 The problem becomes:
minimize𝜽𝟎,𝜽𝟏 𝑱 𝜽𝟎 , 𝜽𝟏
 Ideally: 𝐽 𝜃0 , 𝜃1
=0
11
Cost function
 Training set example with one parameter 𝜃1
(we take here 𝜃0 = 0 to simplify the study)
 If we choose 𝜃1 = 1 ⇒ 𝐽 𝜃1
= 0 (right fig)
12
Cost function

Training set example with one parameter 𝜃1 (we take here 𝜃0 = 0 to simplify
the study)

If we choose 𝜃1 = 0.5
𝐽 𝜃1
𝑚
1
=
෍
ℎ𝜃 𝑥 𝑖
2𝑚
𝑖=1
3
1
=
෍
𝜃1 𝑥 𝑖 − 𝑦 𝑖
2 × 3 𝑖=1
=
−𝑦 𝑖
2
2
1
0.5 − 1 2 + 1 − 2 2 + 1.5 − 3 2 ≈ 0.58
6
13
Cost function
 Plot 𝐽 𝜃1
vs 𝜃1 : ➔ take different values of 𝜃1
◦ 𝜃1 = 1 ➔ ℎ𝜃 𝑥 = 𝜃1 = 𝑥
2
1
𝑚
𝑖
𝑖
σ
◦ 𝐽 𝜃1 = 𝐽 1 =
ℎ 𝑥
−𝑦
2𝑚 𝑖=1 𝜃
1
=
((1 − 1)2 +(2 − 2)2 +(3 − 3)2 ) = 0
2∗3
14
Cost function
◦ 𝜃1 = 0.5 ➔ ℎ𝜃 𝑥 = 0.5 𝑥
◦ 𝐽 𝜃1 = 𝐽 0.5 =
=
1
( 0.5 − 1 2 +
2∗3
1
σ𝑚
ℎ
2𝑚 𝑖=1 𝜃
𝑥𝑖
−𝑦 𝑖
2
1 − 2 2 + 1.5 − 3 2 ) = 0.58
15
Cost function
◦ 𝜃1 = 0 ➔ ℎ𝜃 𝑥 = 0
◦ 𝐽 𝜃1 = 𝐽 0 =
=
1
( 0−1 2+
2∗3
1
σ𝑚
ℎ
2𝑚 𝑖=1 𝜃
𝑥𝑖
−𝑦 𝑖
2
0 − 2 2 + 0 − 3 2 ) = 2.3
16
Cost function
◦ 𝜃1 = 1.5 ➔ ℎ𝜃 𝑥 = 1.5 𝑥
◦ 𝐽 𝜃1 = 𝐽 1.5 =
=
1
( 1.5 − 1 2 +
2∗3
1
σ𝑚
ℎ
2𝑚 𝑖=1 𝜃
𝑥𝑖
−𝑦 𝑖
2
3 − 2 2 + 4.5 − 3 2 ) = 0.58
17
Cost function
 Plot 𝐽 𝜃1
vs 𝜃1 :
 Form of the curve: bowl shape.
 Looking at the curve, the value of 𝜃1 that
minimizes 𝐽 𝜃1 is: 1
18
Cost function
 If the 2 parameters are different from zero,
the cost function should be drawn in 3D, but
it will still have the bowl shape:
19
Cost function
 To make it easier, we can instead draw the
contour plots:
◦ Axis: 𝜃0 and 𝜃1
◦ Each of the ellipses shows a set of points that
takes on the same value for 𝐽 𝜃0 , 𝜃1 .
◦ The minimum is at the middle of these concentric
ellipses.
20
Cost function
 Use
an algorithm for automatically finding
the values of 𝜃0 and 𝜃1 that minimizes the
cost function J.
𝐽 𝜃0 , 𝜃1
𝑚
1
=
෍
ℎ𝜃 𝑥 𝑖
2𝑚
𝑖=1
−𝑦
𝑖
2
21
Gradient descent
 Algorithm used to minimize the cost function.
 Context:
◦ Have some function 𝐽 𝜃0 , 𝜃1 (in a more general
form: 𝐽 𝜃0 , 𝜃1 , … , 𝜃𝑛 )
◦ Goal: min 𝜃0 ,𝜃1 𝐽 𝜃0 , 𝜃1 (in a more general form:
min 𝜃0,𝜃1 ,…,𝜃𝑛 𝐽 𝜃0 , 𝜃1 , … , 𝜃𝑛 )
 Algorithm:
◦ Initialize 𝜃0 , 𝜃1 (in general we take 𝜃0 = 0 and 𝜃1 =
0)
◦ Keep changing 𝜃0 , 𝜃1 in small steps to reduce
𝐽 𝜃0 , 𝜃1 until we end up at a local minimum.
22
Gradient descent
 Example:
23
Gradient descent
◦ 1st initialization point:
24
Gradient descent
◦ 1st initialization point:
25
Gradient descent
◦ Another initialization point:
26
Gradient descent
◦ Another initialization point:
27
Gradient descent formulation
 Simultaneously update 𝜃0 and 𝜃1 such as:
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝐽 𝜃0 , 𝜃1
𝜕𝜃𝑗
(𝑗 = 0 and 𝑗 = 1)
 Repeat until convergence.
 𝛼: learning rate.
28
Gradient descent: implementation note
 Simultaneous update ⇒
𝜕
𝑡𝑒𝑚𝑝0 ≔ 𝜃0 − 𝛼
𝐽 𝜃0 , 𝜃1
𝜕𝜃0
𝜕
𝑡𝑒𝑚𝑝1 ≔ 𝜃1 − 𝛼
𝐽 𝜃0 , 𝜃1
𝜕𝜃1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
29
Gradient descent formulation
 Example: (we take 𝜃0 = 0 for simplicity) with
this starting point:
𝜕
 𝜃1 ≔ 𝜃1 − 𝛼
𝐽 𝜃1
𝜕𝜃1
 The tangent at the point here has a positive
slope ⇒ 𝜃1 will decrease (exact behavior).
30
Gradient descent formulation
 Example: Other starting point:
 The
tangent at this point here has a
negative slope ⇒ 𝜃1 will increase
𝜕
 𝜃1 ≔ 𝜃1 − 𝛼
𝐽 𝜃1
𝜕𝜃1
31
Gradient descent formulation
 If α is too small, gradient descent can be
slow
 If α is too large, gradient descent can
overshoot the minimum. It may fail to
converge, or even diverge.
32
Gradient descent formulation
 Suppose 𝜃1 is at a local optimum of 𝐽 𝜃1 .
What will one step of gradient descent do?
 It will remain unchanged since the slope of
the tangent at this point is zero.
33
Gradient descent formulation
 Gradient
descent can converge to a local
minimum, even with the learning rate α fixed.
 As we approach a local minimum, gradient
descent will automatically take smaller steps
⇒ no need to decrease α over time.
 Gradient
descent
will
automatically take smaller
steps because the slopes will
become less steep.
34
Gradient descent for linear regression
Gradient descent algorithm
Repeat until convergence {
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝐽 𝜃0 , 𝜃1 (𝑗 = 0&𝑗 = 1)}
𝜕𝜃𝑗

Linear regression model
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
𝑚
1
𝐽 𝜃0 , 𝜃1 =
෍
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
2𝑚
𝑖=1
2
Gradient descent for linear regression ⇒ calculate the partial
derivatives:
𝑚
𝜕
𝜕 1
2
𝑖
𝑖
𝐽 𝜃0 , 𝜃1 =
෍
ℎ𝜃 𝑥
−𝑦
𝜕𝜃𝑗
𝜕𝜃𝑗 2𝑚
𝑖=1
𝑚
𝜕 1
2
=
෍
𝜃0 + 𝜃1 𝑥 𝑖 − 𝑦 𝑖
𝜕𝜃𝑗 2𝑚
𝑖=1
𝜕
 𝑗 = 0:
𝐽 𝜃0 , 𝜃1
𝜕𝜃0
𝜕
 𝑗 = 1:
𝐽 𝜃0 , 𝜃1
𝜕𝜃1
1 𝑚
= σ𝑖=1 ℎ𝜃
𝑚
1 𝑚
= σ𝑖=1 ℎ𝜃
𝑚
𝑥𝑖
−𝑦 𝑖
𝑥𝑖
−𝑦 𝑖
∙𝑥 𝑖
35
Gradient descent for linear regression
 Thus, the gradient descent algorithm becomes:
 Repeat until convergence
{
𝑚
1
𝑡𝑒𝑚𝑝0 ≔ 𝜃0 − 𝛼 ෍
ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
𝑚
1
𝑡𝑒𝑚𝑝1 ≔ 𝜃1 − 𝛼 ෍
ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
−𝑦 𝑖
−𝑦 𝑖
∙𝑥 𝑖
}
 Update 𝜃0 and 𝜃1 simultaneously.
36
Gradient descent for linear regression
 One of the issues with gradient descent: it can be
susceptible to local optima.
 But the cost function for linear regression is always
going to be a bowl shaped function ⇒ a convex
function ⇒ this function doesn't have any local optima
except for the one global optimum.
37
Illustration
◦ Initialization: 𝜃0 = 900 and 𝜃1 = −0.1
38
Illustration
◦ One step of gradient descent:
39
Illustration
◦ At the convergence:
40
Question 1
 Which of the following are true statements?
Select all that apply.
a. To make gradient descent converge, we must
slowly decrease 𝛼 over time.
b. Gradient descent is guaranteed to find the global
minimum for any function 𝐽 𝜃0 , 𝜃1 .
c. Gradient descent can converge even if 𝛼is kept
fixed. (But 𝛼 cannot be too large, or else it may fail
to converge)
d. For
the
specific
choice
of
cost
function 𝐽 𝜃0 , 𝜃1 used in linear regression, there
are no local optima (other than the global
optimum).
41
Multiple features
 Instead of having only one feature to predict
the output, we can have multiple features.
 Example:
42
Multiple features: notation
 n: number of features (above example: n = 4)
𝑥 𝑖
: input (features) of the ith training
example. It’s a vector ∈ ℝ𝑛
1416
3
2
(above example: 𝑥 =
).
2
40
(𝑖)
 𝑥𝑗 :
value of feature j in the ith training
(2)
example (above example: 𝑥3 = 2)
43
Multiple features: notation
 In
multivariate linear regression, the
hypothesis becomes:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
 Example:
ℎ𝜃 𝑥 = 80 + 0.1𝑥1 + 0.01𝑥2 + 3𝑥3 − 2𝑥4
44
Multiple features: notation
 For
convenience of notation, define 𝑥0 =
1⇒
 We can work with vectors and matrices:
𝜃0
𝑥0
𝑥1
𝜃1
𝑛+1
𝑥=
∈ ℝ ;𝜃 =
∈ ℝ𝑛+1
⋮
⋮
𝑥𝑛
𝜃𝑛
⇒ ℎ𝜃 𝑥 = 𝜃 𝑇 𝑥
= 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
45
Gradient descent for multiple variables
 Cost function using the vector 𝜃:
𝑚
1
𝐽 𝜃 =
ℎ𝜃 𝑥 𝑖
෍
2𝑚
𝑖=1
−𝑦
𝑖
2
 New gradient descent algorithm (𝑛 ≥ 1)
Repeat {
𝑚
1
(𝑖)
𝑖
𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍
ℎ𝜃 𝑥
−𝑦
∙ 𝑥𝑗
𝑚
𝑖=1
Simultaneously update 𝜃𝑗 for 𝑗 = 0, ⋯ , 𝑛
}
 For 𝑛 = 1, we will have the same equations for
(𝑖)
𝜃0 and 𝜃1 as before because 𝑥0 = 1.
46
Feature scaling
 Feature scaling: make features on similar scale
◦ then gradient descent can converge more quickly.
 Example:
◦ x1 = size (0 – 2000 feet2) ;
◦ x2 = number of bedrooms (0-5).
◦ Ignore for simplicity 𝜃0 . 𝐽 𝜃 is then function of 𝜃1 and 𝜃2 .
◦ The contours of the cost function 𝐽 𝜃 can take on very
skewed elliptical shape: with the 2000-to-5 ratio, it can
be even more tall and skinny ellipses.
47
Feature scaling
 Without feature scaling:
Running gradient descents on this cost function may end up
taking a long time and can oscillate back and forth and take a
long time before it can finally find its way to the global
minimum. Gradient descent will look like this:
48
Feature scaling: general form
 Get
every feature into approximately a
− 1 ≤ 𝑥𝑖 ≤ 1 range.
 x0 is always equal to 1.
 The limits don’t need to be exactly -1 and 1,
but close to these values, for instance:
◦ -2, 3: acceptable
◦ 100: too big
◦ 0.0001: too small
49
Feature scaling
 Feature scaling:
𝑥1 =
size feet2
2000
and 𝑥2 =
number of bedrooms
5
⇒ 0 ≤ 𝑥1 ≤ 1 and 0 ≤ 𝑥2 ≤ 1
 The contours will become: (circular form)
50
Mean normalization
 Replace 𝑥𝑖
with 𝑥𝑖 − 𝜇𝑖 to make features have
approximately zero mean
◦ 𝜇𝑖 : average value of 𝑥𝑖 in the training set
◦ Do not apply to 𝑥0 = 1
 Mean normalization with feature scaling:
𝑥𝑖 −𝜇𝑖
◦ Replace 𝑥𝑖 with
𝑠𝑖
◦ where 𝑠𝑖 is the range of values of feature:
 𝑠𝑖 = max-min,
 or 𝑠𝑖 is the standard deviation.
51
Mean normalization
 Previous Example :
◦ x1 = size (0 – 2000 feet2) ; x2 = number of bedrooms (0-5).
◦ 𝑥1 =
size−1000
2000
and
◦ ⇒ −0.5 ≤ 𝑥1 ≤ 0.5
𝑥2 =
number of bedrooms−2.5
5
and
−0.5 ≤ 𝑥2 ≤ 0.5
(approximately)
52
Learning rate
 Question:
How to make sure gradient
descent is working correctly?
 Answer: Plot 𝐽 𝜃
as a function of the
number of iterations. It should be
decreasing with every iteration.
53
Learning rate
 Question: At
which iteration should we
stop?
 Answer: it depends on the application
◦ Could be 30, 3000 or even 3000000.
 What to do: automatic convergence test
◦ Declare convergence if 𝐽 𝜃 decreases by less than 10-3
in one iteration.
54
Learning rate
 Making sure gradient descent is working correctly:
◦ If we obtain graphs like this:
◦ ⇒ Gradient descent is not working ⇒ use smaller 𝛼.
 Conclusion:
◦ For sufficiently small 𝛼, 𝐽 𝜃 should decrease on every
iteration.
◦ But if 𝛼 is too small, gradient descent can be slow to
converge.
◦ If 𝛼 is too large: 𝐽 𝜃 may not decrease on every iteration and even
may not converge.
55
Learning rate
 Question: So how to choose 𝛼?
 Answer: We should try!
◦ 𝛼 = 0.001; 0.003; 0.01; 0.03; 0.1; 0.3; 1; …
◦ For every value, plot 𝐽 𝜃 as a function of the
number of iterations
56
Question 2
 Suppose
we run gradient descent three
times, with 𝛼 = 0.01, 𝛼 = 0.1 and 𝛼 = 1,
and got the following three plots (labeled A,
B, and C). Which plots corresponds to
which values of 𝛼?
𝛼 = 0.1 𝛼 = 0.01 𝛼 = 1
57
Polynomial regression
 Sometimes,
a straight line cannot fit the
training data very well. Some non-linear
function can be a better fit.
 Example:
Maybe a quadratic function will work: 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 , but
eventually it will go back down so it doesn’t seem right.
A cubic function seems more correct: 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥583
Polynomial regression
 Question: How do we fit a non-linear model to
the data?
 Answer: by a simple modification to the
algorithm of multivariant linear regression:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥3
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 size + 𝜃2 size 2 + 𝜃3 size 3
 So we choose:
𝑥1 = size ; 𝑥2 = size 2 and 𝑥3 = size 3 .
 ATTENTION: FEATURE SCALING BECOMES
INCREASINGLY IMPORTANT.
59
Question 3
 Suppose
you want to predict a house's price as a
function of its size. Your model is ℎ𝜃 𝑥 = 𝜃0 +
𝜃1 size + 𝜃2 size . Suppose size ranges from 1 to
1000 (feet2). Suppose you want to use feature scaling
(without mean normalization). Which of the following
choices for 𝑥1 and 𝑥2 should you use?
a. 𝑥1 = size; 𝑥2 = 32
size
b. 𝑥1 = 32 size and 𝑥2 =
size
c. 𝑥1 = 1000 and 𝑥2 =
size
d. 𝑥1 = 32 and 𝑥2 =
size
size
32
size
60
Question 3’
1. The gradient of a function f will tell me the direction of steepest
descent
a) True
b) False
2. Gradient descent with arbitrary stepsize will always converge
a) True
b) False
61
Question 4
 Suppose you have the training set given in the
table below:
Age (x1)
Height in cm𝜃 (x2)
Weight in kg (y)
4
89
16
9
124
28
5
103
20
We would like to predict a child's weight as a
function of his age and height with the model:
𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
What are X and y?
1 4 89
16
𝑋= 1 9
1 5
124 and 𝑦 = 28
103
20
62
Logistic regression: Hypothesis representation
= 𝑔 𝜃 𝑇 𝑥 where 0 ≤ ℎ𝜃 𝑥 ≤ 1.
 We define: sigmoid function or logistic function:
1
1
𝑔 𝑧 =
⇒ ℎ𝜃 𝑥 =
𝑇𝑥
−𝑧
−𝜃
1+𝑒
1+𝑒
 We want ℎ𝜃 𝑥
 The problem becomes: fitting the parameters 𝜃.
63
Cost function
 Goal: fit the parameters 𝜃 for the logistic hypothesis.
 Situation:
◦ Training
set
with
𝑥 1 ,𝑦 1 ,⋯, 𝑥 𝑚 ,𝑦 𝑚
◦ n features:
𝑖
𝑥0
𝑥
𝑖
m
examples:
𝑖
= 𝑥1 ∈ ℝ𝑛+1
⋮
𝑖
𝑥𝑛
◦ 𝒙𝟎 = 𝟏, 𝒚 ∈ 𝟎, 𝟏
◦ Logistic regression: ℎ𝜃 𝑥 =
1
𝑇
1+𝑒 −𝜃 𝑥
 Question: how to choose parameters 𝜃?
64
Cost function
 Answer:
Recall the cost function in the linear
2
1
𝑚
𝑖
𝑖
σ𝑖=1 ℎ𝜃 𝑥
regression case: 𝐽 𝜃 =
−𝑦
2𝑚
 We can re-write it as:
𝑚
1
𝐽 𝜃 = ෍ cost ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖
𝑚
𝑖=1
 Where cost ℎ𝜃
𝑥𝑖
,𝑦 𝑖
=
1
2
ℎ𝜃 𝑥
𝑖
−𝑦
𝑖
2
⇒ the cost function is a sum over the training set of
the cost term.
65
Cost function
 We cannot use the same cost function for
the logistic regression since ℎ𝜃 𝑥
nonlinear,
is
◦ 𝐽 𝜃 will be non-convex
◦ ⇒ it will present many local minima
◦ ⇒ no guarantee to reach the global minimum.
66
Cost function
 The cost function used for logistic regression:
cost ℎ𝜃 𝑥 , 𝑦 = ቐ
− log ℎ𝜃 𝑥
if 𝑦 = 1
− log 1 − ℎ𝜃 𝑥
if 𝑦 = 0
 If 𝑦 = 1, graph of − log ℎ𝜃 𝑥
:
𝑦=1
ℎ𝜃 𝑥
67
Cost function
 If 𝑦 = 0, graph of − log 1 − ℎ𝜃 𝑥
:
𝑦=0
68
Question 5
 For
the cost function used for logistic
regression, which of the following are true?
a. If ℎ𝜃 𝑥 = 𝑦, then cost ℎ𝜃 𝑥 , 𝑦 = 0 ∀𝑦 (𝑦 = 0 or
𝑦 = 1)
b. If 𝑦 = 0, then cost ℎ𝜃 𝑥 , 𝑦 → ∞ as ℎ𝜃 𝑥 → 1
c. If 𝑦 = 0, then cost ℎ𝜃 𝑥 , 𝑦 → ∞ as ℎ𝜃 𝑥 → 0
d. ∀𝑦, if ℎ𝜃 𝑥 = 0.5 then cost ℎ𝜃 𝑥 , 𝑦 > 0
69
Simplified cost function and gradient descent
 Since 𝑦 = 0 or 𝑦 = 1 always, then 𝑐𝑜𝑠𝑡 can be written as:
cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log ℎ𝜃 𝑥
− 1 − 𝑦 log 1 − ℎ𝜃 𝑥
 Logistic regression cost function:
𝑚
1
𝐽 𝜃 = ෍ cost ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖
𝑚
𝑖=1
𝑚
1
=− ෍
𝑦 𝑖 log ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
+ 1 − 𝑦 𝑖 log 1 − ℎ𝜃 𝑥 𝑖
 This
cost function derives from statistics using the
principle of maximum likelihood estimation. It is convex.
70
Simplified cost function and gradient descent
 To fit parameters 𝜃: get min 𝜃 𝐽 𝜃 .
 To make a prediction given new x:
Output
ℎ𝜃 𝑥 =
1
𝑇
1+𝑒 −𝜃 𝑥
which
is
𝑝 𝑦 = 1|𝑥; 𝜃
71
Simplified cost function and gradient descent
 Gradient descent to minimize the cost function
◦ Repeat {
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝐽 𝜃
𝜕𝜃𝑗
(simultaneously update all 𝜃𝑗 )
}
◦ With
𝜕
𝜕𝜃𝑗
𝐽 𝜃 =
1
𝑚
𝑖
σ𝑚
ℎ
𝑥
𝑖=1 𝜃
− 𝑦 𝑖 . 𝑥𝑗
𝑖
 Thus the gradient descent algorithm becomes:
◦ Repeat {
𝑚
1
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍
ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
}
− 𝑦 𝑖 . 𝑥𝑗
𝑖
(simultaneously update all 𝜃𝑗 )
72
Simplified cost function and gradient descent
 Question:
how to make sure that the
algorithm is converging?
 Answer: as for linear regression, draw 𝐽 𝜃
as function of the iterations and make sure
it is decreasing with every iteration.
73
Exercise on regression
Exercise:
 Apply the gradient descent algorithm using
logistic regression for one epoch: to solve
the logic or problem.
74
Téléchargement