Regression
Linear Regression with one variable
◦ Model representation
◦ Cost function
◦ Gradient descent
Linear Regression with multiple variables
◦ Multiple features
◦ Feature scaling
◦ Learning rate
1
Model Representation
Consider the house pricing example:
◦ Supervised learning: the “right answer” is given
for each example in the data.
◦ Regression problem: Predict real-valued output.
2
Model Representation
Given this data set, predict the price of a house
(example house of size 1250 square).
To do this estimation, fit a model: may be a straight
line
Based
on the fitted line, the price of the 1250
square house is:
3
Model Representation
In
supervised learning, there is a data set
called training set.
For the example of the housing prices:
Size in feet2 (x)
Price ($) in 1000's (y)
2104
460
1416
232
1534
315
852
178
…
…
Goal
of ML: learn from this data how to
predict prices of the houses.
4
Mathematical Notation
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
X: the space of input values.
Y: the space of output values.
(x,y): one training set
(x(i),y(i)): the ith training example.
A single row in this table corresponds to a
single training example.
Example: in the above table: y(3) = 315.
5
How does the supervised learning
algorithm work?
We “feed” the training set to our learning
algorithm.
Job of the learning algorithm: to output a
function denoted h (stands for hypothesis).
Job
of the hypothesis in the example of
house pricing: it is a function that takes as
input the size of a house (the value of x) and
outputs the estimated value of the price y.
So h is a function that maps from x's to y’s.
6
How does the supervised learning
algorithm work?
More formally: given a training set, to learn a
function h: X → Y so that h(x) is a “good”
predictor for the corresponding value of x.
7
Why linear model?
We
will start first with fitting linear
functions, and we will build on this to
eventually have more complex models
(non-linear models), and more complex
learning algorithms.
y
x
8
Linear model hypothesis
Try to fit a straight line into the training set
using:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
Short-hand: we use ℎ 𝑥
instead of ℎ𝜃 𝑥 .
𝜃𝑖 ’s: parameters.
One
feature ⇒ one x variable ⇒ Linear
regression with one variable or univariate
linear regression.
9
Cost function
Different
choices of the parameter's 𝜃0
and 𝜃1
→ different hypothesis functions
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥.
ℎ𝜃 𝑥 = 1.5
ℎ𝜃 𝑥 = 0.5 𝑥
ℎ𝜃 𝑥 = 1 + 0.5 𝑥
10
Cost function
Regression: Choose 𝜃0 and 𝜃1 so that ℎ𝜃 𝑥
is
close to y of the training examples (x,y).
Cost function: (squared error function)
𝐽 𝜃0 , 𝜃1
𝑚
1
=
ℎ𝜃 𝑥 𝑖
2𝑚
𝑖=1
−𝑦
𝑖
2
The problem becomes:
minimize𝜽𝟎,𝜽𝟏 𝑱 𝜽𝟎 , 𝜽𝟏
Ideally: 𝐽 𝜃0 , 𝜃1
=0
11
Cost function
Training set example with one parameter 𝜃1
(we take here 𝜃0 = 0 to simplify the study)
If we choose 𝜃1 = 1 ⇒ 𝐽 𝜃1
= 0 (right fig)
12
Cost function
Training set example with one parameter 𝜃1 (we take here 𝜃0 = 0 to simplify
the study)
If we choose 𝜃1 = 0.5
𝐽 𝜃1
𝑚
1
=
ℎ𝜃 𝑥 𝑖
2𝑚
𝑖=1
3
1
=
𝜃1 𝑥 𝑖 − 𝑦 𝑖
2 × 3 𝑖=1
=
−𝑦 𝑖
2
2
1
0.5 − 1 2 + 1 − 2 2 + 1.5 − 3 2 ≈ 0.58
6
13
Cost function
Plot 𝐽 𝜃1
vs 𝜃1 : ➔ take different values of 𝜃1
◦ 𝜃1 = 1 ➔ ℎ𝜃 𝑥 = 𝜃1 = 𝑥
2
1
𝑚
𝑖
𝑖
σ
◦ 𝐽 𝜃1 = 𝐽 1 =
ℎ 𝑥
−𝑦
2𝑚 𝑖=1 𝜃
1
=
((1 − 1)2 +(2 − 2)2 +(3 − 3)2 ) = 0
2∗3
14
Cost function
◦ 𝜃1 = 0.5 ➔ ℎ𝜃 𝑥 = 0.5 𝑥
◦ 𝐽 𝜃1 = 𝐽 0.5 =
=
1
( 0.5 − 1 2 +
2∗3
1
σ𝑚
ℎ
2𝑚 𝑖=1 𝜃
𝑥𝑖
−𝑦 𝑖
2
1 − 2 2 + 1.5 − 3 2 ) = 0.58
15
Cost function
◦ 𝜃1 = 0 ➔ ℎ𝜃 𝑥 = 0
◦ 𝐽 𝜃1 = 𝐽 0 =
=
1
( 0−1 2+
2∗3
1
σ𝑚
ℎ
2𝑚 𝑖=1 𝜃
𝑥𝑖
−𝑦 𝑖
2
0 − 2 2 + 0 − 3 2 ) = 2.3
16
Cost function
◦ 𝜃1 = 1.5 ➔ ℎ𝜃 𝑥 = 1.5 𝑥
◦ 𝐽 𝜃1 = 𝐽 1.5 =
=
1
( 1.5 − 1 2 +
2∗3
1
σ𝑚
ℎ
2𝑚 𝑖=1 𝜃
𝑥𝑖
−𝑦 𝑖
2
3 − 2 2 + 4.5 − 3 2 ) = 0.58
17
Cost function
Plot 𝐽 𝜃1
vs 𝜃1 :
Form of the curve: bowl shape.
Looking at the curve, the value of 𝜃1 that
minimizes 𝐽 𝜃1 is: 1
18
Cost function
If the 2 parameters are different from zero,
the cost function should be drawn in 3D, but
it will still have the bowl shape:
19
Cost function
To make it easier, we can instead draw the
contour plots:
◦ Axis: 𝜃0 and 𝜃1
◦ Each of the ellipses shows a set of points that
takes on the same value for 𝐽 𝜃0 , 𝜃1 .
◦ The minimum is at the middle of these concentric
ellipses.
20
Cost function
Use
an algorithm for automatically finding
the values of 𝜃0 and 𝜃1 that minimizes the
cost function J.
𝐽 𝜃0 , 𝜃1
𝑚
1
=
ℎ𝜃 𝑥 𝑖
2𝑚
𝑖=1
−𝑦
𝑖
2
21
Gradient descent
Algorithm used to minimize the cost function.
Context:
◦ Have some function 𝐽 𝜃0 , 𝜃1 (in a more general
form: 𝐽 𝜃0 , 𝜃1 , … , 𝜃𝑛 )
◦ Goal: min 𝜃0 ,𝜃1 𝐽 𝜃0 , 𝜃1 (in a more general form:
min 𝜃0,𝜃1 ,…,𝜃𝑛 𝐽 𝜃0 , 𝜃1 , … , 𝜃𝑛 )
Algorithm:
◦ Initialize 𝜃0 , 𝜃1 (in general we take 𝜃0 = 0 and 𝜃1 =
0)
◦ Keep changing 𝜃0 , 𝜃1 in small steps to reduce
𝐽 𝜃0 , 𝜃1 until we end up at a local minimum.
22
Gradient descent
Example:
23
Gradient descent
◦ 1st initialization point:
24
Gradient descent
◦ 1st initialization point:
25
Gradient descent
◦ Another initialization point:
26
Gradient descent
◦ Another initialization point:
27
Gradient descent formulation
Simultaneously update 𝜃0 and 𝜃1 such as:
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝐽 𝜃0 , 𝜃1
𝜕𝜃𝑗
(𝑗 = 0 and 𝑗 = 1)
Repeat until convergence.
𝛼: learning rate.
28
Gradient descent: implementation note
Simultaneous update ⇒
𝜕
𝑡𝑒𝑚𝑝0 ≔ 𝜃0 − 𝛼
𝐽 𝜃0 , 𝜃1
𝜕𝜃0
𝜕
𝑡𝑒𝑚𝑝1 ≔ 𝜃1 − 𝛼
𝐽 𝜃0 , 𝜃1
𝜕𝜃1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
29
Gradient descent formulation
Example: (we take 𝜃0 = 0 for simplicity) with
this starting point:
𝜕
𝜃1 ≔ 𝜃1 − 𝛼
𝐽 𝜃1
𝜕𝜃1
The tangent at the point here has a positive
slope ⇒ 𝜃1 will decrease (exact behavior).
30
Gradient descent formulation
Example: Other starting point:
The
tangent at this point here has a
negative slope ⇒ 𝜃1 will increase
𝜕
𝜃1 ≔ 𝜃1 − 𝛼
𝐽 𝜃1
𝜕𝜃1
31
Gradient descent formulation
If α is too small, gradient descent can be
slow
If α is too large, gradient descent can
overshoot the minimum. It may fail to
converge, or even diverge.
32
Gradient descent formulation
Suppose 𝜃1 is at a local optimum of 𝐽 𝜃1 .
What will one step of gradient descent do?
It will remain unchanged since the slope of
the tangent at this point is zero.
33
Gradient descent formulation
Gradient
descent can converge to a local
minimum, even with the learning rate α fixed.
As we approach a local minimum, gradient
descent will automatically take smaller steps
⇒ no need to decrease α over time.
Gradient
descent
will
automatically take smaller
steps because the slopes will
become less steep.
34
Gradient descent for linear regression
Gradient descent algorithm
Repeat until convergence {
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝐽 𝜃0 , 𝜃1 (𝑗 = 0&𝑗 = 1)}
𝜕𝜃𝑗
Linear regression model
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
𝑚
1
𝐽 𝜃0 , 𝜃1 =
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
2𝑚
𝑖=1
2
Gradient descent for linear regression ⇒ calculate the partial
derivatives:
𝑚
𝜕
𝜕 1
2
𝑖
𝑖
𝐽 𝜃0 , 𝜃1 =
ℎ𝜃 𝑥
−𝑦
𝜕𝜃𝑗
𝜕𝜃𝑗 2𝑚
𝑖=1
𝑚
𝜕 1
2
=
𝜃0 + 𝜃1 𝑥 𝑖 − 𝑦 𝑖
𝜕𝜃𝑗 2𝑚
𝑖=1
𝜕
𝑗 = 0:
𝐽 𝜃0 , 𝜃1
𝜕𝜃0
𝜕
𝑗 = 1:
𝐽 𝜃0 , 𝜃1
𝜕𝜃1
1 𝑚
= σ𝑖=1 ℎ𝜃
𝑚
1 𝑚
= σ𝑖=1 ℎ𝜃
𝑚
𝑥𝑖
−𝑦 𝑖
𝑥𝑖
−𝑦 𝑖
∙𝑥 𝑖
35
Gradient descent for linear regression
Thus, the gradient descent algorithm becomes:
Repeat until convergence
{
𝑚
1
𝑡𝑒𝑚𝑝0 ≔ 𝜃0 − 𝛼
ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
𝑚
1
𝑡𝑒𝑚𝑝1 ≔ 𝜃1 − 𝛼
ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
𝜃0 ≔ 𝑡𝑒𝑚𝑝0
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
−𝑦 𝑖
−𝑦 𝑖
∙𝑥 𝑖
}
Update 𝜃0 and 𝜃1 simultaneously.
36
Gradient descent for linear regression
One of the issues with gradient descent: it can be
susceptible to local optima.
But the cost function for linear regression is always
going to be a bowl shaped function ⇒ a convex
function ⇒ this function doesn't have any local optima
except for the one global optimum.
37
Illustration
◦ Initialization: 𝜃0 = 900 and 𝜃1 = −0.1
38
Illustration
◦ One step of gradient descent:
39
Illustration
◦ At the convergence:
40
Question 1
Which of the following are true statements?
Select all that apply.
a. To make gradient descent converge, we must
slowly decrease 𝛼 over time.
b. Gradient descent is guaranteed to find the global
minimum for any function 𝐽 𝜃0 , 𝜃1 .
c. Gradient descent can converge even if 𝛼is kept
fixed. (But 𝛼 cannot be too large, or else it may fail
to converge)
d. For
the
specific
choice
of
cost
function 𝐽 𝜃0 , 𝜃1 used in linear regression, there
are no local optima (other than the global
optimum).
41
Multiple features
Instead of having only one feature to predict
the output, we can have multiple features.
Example:
42
Multiple features: notation
n: number of features (above example: n = 4)
𝑥 𝑖
: input (features) of the ith training
example. It’s a vector ∈ ℝ𝑛
1416
3
2
(above example: 𝑥 =
).
2
40
(𝑖)
𝑥𝑗 :
value of feature j in the ith training
(2)
example (above example: 𝑥3 = 2)
43
Multiple features: notation
In
multivariate linear regression, the
hypothesis becomes:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
Example:
ℎ𝜃 𝑥 = 80 + 0.1𝑥1 + 0.01𝑥2 + 3𝑥3 − 2𝑥4
44
Multiple features: notation
For
convenience of notation, define 𝑥0 =
1⇒
We can work with vectors and matrices:
𝜃0
𝑥0
𝑥1
𝜃1
𝑛+1
𝑥=
∈ ℝ ;𝜃 =
∈ ℝ𝑛+1
⋮
⋮
𝑥𝑛
𝜃𝑛
⇒ ℎ𝜃 𝑥 = 𝜃 𝑇 𝑥
= 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
45
Gradient descent for multiple variables
Cost function using the vector 𝜃:
𝑚
1
𝐽 𝜃 =
ℎ𝜃 𝑥 𝑖
2𝑚
𝑖=1
−𝑦
𝑖
2
New gradient descent algorithm (𝑛 ≥ 1)
Repeat {
𝑚
1
(𝑖)
𝑖
𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
ℎ𝜃 𝑥
−𝑦
∙ 𝑥𝑗
𝑚
𝑖=1
Simultaneously update 𝜃𝑗 for 𝑗 = 0, ⋯ , 𝑛
}
For 𝑛 = 1, we will have the same equations for
(𝑖)
𝜃0 and 𝜃1 as before because 𝑥0 = 1.
46
Feature scaling
Feature scaling: make features on similar scale
◦ then gradient descent can converge more quickly.
Example:
◦ x1 = size (0 – 2000 feet2) ;
◦ x2 = number of bedrooms (0-5).
◦ Ignore for simplicity 𝜃0 . 𝐽 𝜃 is then function of 𝜃1 and 𝜃2 .
◦ The contours of the cost function 𝐽 𝜃 can take on very
skewed elliptical shape: with the 2000-to-5 ratio, it can
be even more tall and skinny ellipses.
47
Feature scaling
Without feature scaling:
Running gradient descents on this cost function may end up
taking a long time and can oscillate back and forth and take a
long time before it can finally find its way to the global
minimum. Gradient descent will look like this:
48
Feature scaling: general form
Get
every feature into approximately a
− 1 ≤ 𝑥𝑖 ≤ 1 range.
x0 is always equal to 1.
The limits don’t need to be exactly -1 and 1,
but close to these values, for instance:
◦ -2, 3: acceptable
◦ 100: too big
◦ 0.0001: too small
49
Feature scaling
Feature scaling:
𝑥1 =
size feet2
2000
and 𝑥2 =
number of bedrooms
5
⇒ 0 ≤ 𝑥1 ≤ 1 and 0 ≤ 𝑥2 ≤ 1
The contours will become: (circular form)
50
Mean normalization
Replace 𝑥𝑖
with 𝑥𝑖 − 𝜇𝑖 to make features have
approximately zero mean
◦ 𝜇𝑖 : average value of 𝑥𝑖 in the training set
◦ Do not apply to 𝑥0 = 1
Mean normalization with feature scaling:
𝑥𝑖 −𝜇𝑖
◦ Replace 𝑥𝑖 with
𝑠𝑖
◦ where 𝑠𝑖 is the range of values of feature:
𝑠𝑖 = max-min,
or 𝑠𝑖 is the standard deviation.
51
Mean normalization
Previous Example :
◦ x1 = size (0 – 2000 feet2) ; x2 = number of bedrooms (0-5).
◦ 𝑥1 =
size−1000
2000
and
◦ ⇒ −0.5 ≤ 𝑥1 ≤ 0.5
𝑥2 =
number of bedrooms−2.5
5
and
−0.5 ≤ 𝑥2 ≤ 0.5
(approximately)
52
Learning rate
Question:
How to make sure gradient
descent is working correctly?
Answer: Plot 𝐽 𝜃
as a function of the
number of iterations. It should be
decreasing with every iteration.
53
Learning rate
Question: At
which iteration should we
stop?
Answer: it depends on the application
◦ Could be 30, 3000 or even 3000000.
What to do: automatic convergence test
◦ Declare convergence if 𝐽 𝜃 decreases by less than 10-3
in one iteration.
54
Learning rate
Making sure gradient descent is working correctly:
◦ If we obtain graphs like this:
◦ ⇒ Gradient descent is not working ⇒ use smaller 𝛼.
Conclusion:
◦ For sufficiently small 𝛼, 𝐽 𝜃 should decrease on every
iteration.
◦ But if 𝛼 is too small, gradient descent can be slow to
converge.
◦ If 𝛼 is too large: 𝐽 𝜃 may not decrease on every iteration and even
may not converge.
55
Learning rate
Question: So how to choose 𝛼?
Answer: We should try!
◦ 𝛼 = 0.001; 0.003; 0.01; 0.03; 0.1; 0.3; 1; …
◦ For every value, plot 𝐽 𝜃 as a function of the
number of iterations
56
Question 2
Suppose
we run gradient descent three
times, with 𝛼 = 0.01, 𝛼 = 0.1 and 𝛼 = 1,
and got the following three plots (labeled A,
B, and C). Which plots corresponds to
which values of 𝛼?
𝛼 = 0.1 𝛼 = 0.01 𝛼 = 1
57
Polynomial regression
Sometimes,
a straight line cannot fit the
training data very well. Some non-linear
function can be a better fit.
Example:
Maybe a quadratic function will work: 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 , but
eventually it will go back down so it doesn’t seem right.
A cubic function seems more correct: 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + 𝜃3 𝑥583
Polynomial regression
Question: How do we fit a non-linear model to
the data?
Answer: by a simple modification to the
algorithm of multivariant linear regression:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + 𝜃3 𝑥3
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 size + 𝜃2 size 2 + 𝜃3 size 3
So we choose:
𝑥1 = size ; 𝑥2 = size 2 and 𝑥3 = size 3 .
ATTENTION: FEATURE SCALING BECOMES
INCREASINGLY IMPORTANT.
59
Question 3
Suppose
you want to predict a house's price as a
function of its size. Your model is ℎ𝜃 𝑥 = 𝜃0 +
𝜃1 size + 𝜃2 size . Suppose size ranges from 1 to
1000 (feet2). Suppose you want to use feature scaling
(without mean normalization). Which of the following
choices for 𝑥1 and 𝑥2 should you use?
a. 𝑥1 = size; 𝑥2 = 32
size
b. 𝑥1 = 32 size and 𝑥2 =
size
c. 𝑥1 = 1000 and 𝑥2 =
size
d. 𝑥1 = 32 and 𝑥2 =
size
size
32
size
60
Question 3’
1. The gradient of a function f will tell me the direction of steepest
descent
a) True
b) False
2. Gradient descent with arbitrary stepsize will always converge
a) True
b) False
61
Question 4
Suppose you have the training set given in the
table below:
Age (x1)
Height in cm𝜃 (x2)
Weight in kg (y)
4
89
16
9
124
28
5
103
20
We would like to predict a child's weight as a
function of his age and height with the model:
𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
What are X and y?
1 4 89
16
𝑋= 1 9
1 5
124 and 𝑦 = 28
103
20
62
Logistic regression: Hypothesis representation
= 𝑔 𝜃 𝑇 𝑥 where 0 ≤ ℎ𝜃 𝑥 ≤ 1.
We define: sigmoid function or logistic function:
1
1
𝑔 𝑧 =
⇒ ℎ𝜃 𝑥 =
𝑇𝑥
−𝑧
−𝜃
1+𝑒
1+𝑒
We want ℎ𝜃 𝑥
The problem becomes: fitting the parameters 𝜃.
63
Cost function
Goal: fit the parameters 𝜃 for the logistic hypothesis.
Situation:
◦ Training
set
with
𝑥 1 ,𝑦 1 ,⋯, 𝑥 𝑚 ,𝑦 𝑚
◦ n features:
𝑖
𝑥0
𝑥
𝑖
m
examples:
𝑖
= 𝑥1 ∈ ℝ𝑛+1
⋮
𝑖
𝑥𝑛
◦ 𝒙𝟎 = 𝟏, 𝒚 ∈ 𝟎, 𝟏
◦ Logistic regression: ℎ𝜃 𝑥 =
1
𝑇
1+𝑒 −𝜃 𝑥
Question: how to choose parameters 𝜃?
64
Cost function
Answer:
Recall the cost function in the linear
2
1
𝑚
𝑖
𝑖
σ𝑖=1 ℎ𝜃 𝑥
regression case: 𝐽 𝜃 =
−𝑦
2𝑚
We can re-write it as:
𝑚
1
𝐽 𝜃 = cost ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖
𝑚
𝑖=1
Where cost ℎ𝜃
𝑥𝑖
,𝑦 𝑖
=
1
2
ℎ𝜃 𝑥
𝑖
−𝑦
𝑖
2
⇒ the cost function is a sum over the training set of
the cost term.
65
Cost function
We cannot use the same cost function for
the logistic regression since ℎ𝜃 𝑥
nonlinear,
is
◦ 𝐽 𝜃 will be non-convex
◦ ⇒ it will present many local minima
◦ ⇒ no guarantee to reach the global minimum.
66
Cost function
The cost function used for logistic regression:
cost ℎ𝜃 𝑥 , 𝑦 = ቐ
− log ℎ𝜃 𝑥
if 𝑦 = 1
− log 1 − ℎ𝜃 𝑥
if 𝑦 = 0
If 𝑦 = 1, graph of − log ℎ𝜃 𝑥
:
𝑦=1
ℎ𝜃 𝑥
67
Cost function
If 𝑦 = 0, graph of − log 1 − ℎ𝜃 𝑥
:
𝑦=0
68
Question 5
For
the cost function used for logistic
regression, which of the following are true?
a. If ℎ𝜃 𝑥 = 𝑦, then cost ℎ𝜃 𝑥 , 𝑦 = 0 ∀𝑦 (𝑦 = 0 or
𝑦 = 1)
b. If 𝑦 = 0, then cost ℎ𝜃 𝑥 , 𝑦 → ∞ as ℎ𝜃 𝑥 → 1
c. If 𝑦 = 0, then cost ℎ𝜃 𝑥 , 𝑦 → ∞ as ℎ𝜃 𝑥 → 0
d. ∀𝑦, if ℎ𝜃 𝑥 = 0.5 then cost ℎ𝜃 𝑥 , 𝑦 > 0
69
Simplified cost function and gradient descent
Since 𝑦 = 0 or 𝑦 = 1 always, then 𝑐𝑜𝑠𝑡 can be written as:
cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log ℎ𝜃 𝑥
− 1 − 𝑦 log 1 − ℎ𝜃 𝑥
Logistic regression cost function:
𝑚
1
𝐽 𝜃 = cost ℎ𝜃 𝑥 𝑖 , 𝑦 𝑖
𝑚
𝑖=1
𝑚
1
=−
𝑦 𝑖 log ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
+ 1 − 𝑦 𝑖 log 1 − ℎ𝜃 𝑥 𝑖
This
cost function derives from statistics using the
principle of maximum likelihood estimation. It is convex.
70
Simplified cost function and gradient descent
To fit parameters 𝜃: get min 𝜃 𝐽 𝜃 .
To make a prediction given new x:
Output
ℎ𝜃 𝑥 =
1
𝑇
1+𝑒 −𝜃 𝑥
which
is
𝑝 𝑦 = 1|𝑥; 𝜃
71
Simplified cost function and gradient descent
Gradient descent to minimize the cost function
◦ Repeat {
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
𝐽 𝜃
𝜕𝜃𝑗
(simultaneously update all 𝜃𝑗 )
}
◦ With
𝜕
𝜕𝜃𝑗
𝐽 𝜃 =
1
𝑚
𝑖
σ𝑚
ℎ
𝑥
𝑖=1 𝜃
− 𝑦 𝑖 . 𝑥𝑗
𝑖
Thus the gradient descent algorithm becomes:
◦ Repeat {
𝑚
1
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
ℎ𝜃 𝑥 𝑖
𝑚
𝑖=1
}
− 𝑦 𝑖 . 𝑥𝑗
𝑖
(simultaneously update all 𝜃𝑗 )
72
Simplified cost function and gradient descent
Question:
how to make sure that the
algorithm is converging?
Answer: as for linear regression, draw 𝐽 𝜃
as function of the iterations and make sure
it is decreasing with every iteration.
73
Exercise on regression
Exercise:
Apply the gradient descent algorithm using
logistic regression for one epoch: to solve
the logic or problem.
74