Telechargé par reda falhi

midterm.pdf

publicité
CSC2515 Fall 2020
Midterm Test
Start your scan at this page.
1. True/False [16 points] For each statement below, say whether it is true or false, and
give a one or two sentence justification of your answer.
(a) [2 points] As we increase the number of parameters in a model, both the bias
and variance of our predictions should decrease since we can better fit the data.
!"#$%
&
'#()*+,)
"$
7%
"32
7%
"22
/"."-%(%.$
-*.%
6
-*2%#$
%9/%0( 0*-/#%9
()%
*8%.
-*2%#
*8%."##
15(
*1
$(.+0(+.%
()%
/"."-%(%.$ )%#/ 0"/(+.%
-*.%
7)50)
6
2"("
-*.%
5$
75##
0*-/#%9
6
6
()%
530.%"$%
45"$
.%2+0%$
"32
8".5"30%
(b) [2 points] We can always improve the performance of clustering algorithms like
k-means by removing features and reducing the dimensionality of our data.
!"#$%
&
:3#%$$
2"("
()%
.%2+0(5*3
*1
#*$$
0"+$%
0"3
53(.53$50"##;
".%
5-/*.("3(
=
#*7
25-%3$5*3"#
<
*1
1"0(*.$
8".5"(5*3
=
1%"(+.%$
.%-*853,
&
&
=
"32
531*.-"(5*3
2"("
"32
=
2%-*3$ = 5*3
#*$$
(c) [2 points] Bagging and boosting are both ways of reducing the variance of our
models.
>.+%
&
?",,53,
?**$(53, .%2+0%$
*1
8".5"30%
-*2%#$
*1
8".5"30%
.%2+0%$
=
-*2%#$
$530%
4%0"+$%
=
5(
=
7%
("@%$
=
=
"8%.",53,
".%
7%5,)(%2
*1
"8%.",%
%9"-/#%$
*8%.
=
7%"@
#%".3%.$
(d) [2 points] Machine learning models work automatically without the need for the
user to manually set hyperparameters.
!"#$%
&
:$%.
"(
=
3%%2$
()%
()%
A&
,
4%,53353,
B=
()%
=
-*2%#
);/%./"."-%(%. ,58%3
4%$(
&
6
"32
);/%./"."-%(%.$ -"3+"##;
$%(
/.*/%.#;
(*
=
53
@
<
CC
DE
53
-535
<
0*-/".%
0"3
=
53
4"(0)
" =
8"#52"(5*3
."3,% 2%153%2
$(*0)"$(50
1
" =
+$%.$
2%$0%3(
."3,%
= *. =
$/%05150
"+(*-"(50"##;
%..*.$
4;
,."25%3( =
(*
&
&
(*
8"#+%
$%#%0(
&
CSC2515 Fall 2020
Midterm Test
(e) [2 points] Most machine learning algorithms, in general, can learn a model with
a Bayes optimal error rate.
!"#$%
"#,*.5()-
&
?";%$ */(5-"#
;5%#2$
7)50)
."(%
%..*.
45"$
F%.*
5$
8".5"30%
"32
=
4%$(
()%
=
G".2#;
&
"#,*.5()-
"3;
2*
(*
)*/%
%8%.
=
7%
=
75()
=
0*8%.%2
#%".353,
"3;
()5$
53
"0)5%8%$
0*+.$%
()5$
=
6
&
(f) [2 points] The k-means algorithm is guaranteed to converge to the global minimum
!"#$%
>)%
3*(
B
<
&
I#J-
*4H%0(58%
=
-%"3$
$(+0@
=
<
0*38%9
5$ = 3*(
-%"3$
()%
,#*4"#
53
()%
-535-"
#*0"#
53
@
(*
0*38%.,%
(*
,+"."3(%%2
(. L M *1
K$
-535-+-
'#$*
&
"#,*.5()-
()%.%
=
2%$0%3(
0**.253"(%
=
$*
6
5$ =
I
*3
5$
/.%8%3(
(*
3*()53,
=
&
(g) [2 points] When optimizing a regression problem where we know that only some
of our features are useful, we should use L1 regularization.
>.+%
J1
;*+
&
4%#5%8%
;*+
=
N6
+$%
*3#;
()"(
.%,+#".5F"(5*3
*1
1."0(5*3
$-"##
"
%30*+.",%$
=
$-"##
=
1%"(+.%$
-"3;
7%5,)($
=
=
4%
(*
#5@%#;
5$
=
%9"0(#;
.%#%8"3(
4%
(*
=
F%.*
"32
6
,58%$ ;*+
8%0(*.
$/".$% */(5-"# 7%5,)(
(h) [2 points] For a given binary classification problem, a linear SVM model and a
logistic regression model will converge to the same decision boundary.
!"#$%
!53%
V
>)%;
&
N*,5$(50
$*
=
()+$
EPQ
S
W
73&/5&3,&A&05
=K
<
;#
T
0+.4
N*,5$(50 X%,.%$$5*3
-53
!J N50%
6
OY
&
=(
6
(
#*$$
A3(.*/;
J
U
F(.#7##
G53,%
"##
()%
3*(
=
8%.;
5$
=
$"-%
$5-5#".
$5-5#".
4*+32".5%$
2%05$5*3
=
75()
M
&
$+,,%$(53,
".%
$5-5#".
O#"$$515%.
N53%".
O.*$$
<
()%;
4+(
$5-5#".
".%
6
4+(
&
(*
>)%
()%
"$-/(*50
7%5,)($
1.*-
()%
$"-%
=
3*( =
,."25%3(
&
R
K
#*$$
53
#*$$
G53,%
#*$$
"32
N" .%,+#".5F"(5*3
2
<
ZJ>
F
=$
*1
4%)"85*.
EPQ
6
2%$0%3(
6
<
*
M
R
CSC2515 Fall 2020
Midterm Test
2. Probability and Bayes Rule [14 points]
You have just purchased a two-sided die, which can come up either 1 or 2. You want
to use your crazy die in some betting games with friends later this evening, but first
you want to know the probability that it will roll a 1. You know know it came either
from factory 0 or factory 1, but not which.
Factory 0 produces dice that roll a 1 with probability 0 . Factory 1 produces dice that
roll a 1 with probability 1 . You believe initially that your die came from factory 1
with probability ⌘1 .
(a) [2 points] Without seeing any rolls of this die, what would be your predicted
probability that it would roll at 1?
L
.*##
2%3*(%
^.0
9
[
"$
_M
V
\]
0<
V
V
2*
V
L
J
&
A<
J
J
O#
&
D
&
;
<
= *
M
^.%(*
^.%(
/09
V/
J(/09
(
1"0(*.;
M
(
6
(%
"$
M
&
U
#*
J
6
6
<_
M
<
<
&
] `
6
6
(b) [2 points] If we roll the die and observe the outcome, what can we infer about
where the coin was manufactured?
L D )"4 5$ 0*+3( 1*. 9HV"
(% #*
"$
R_6\R
b9V\
J
1"0(*.;
^Q&Q[b(M'/0(c&/09J(
.*##
2%3*(%
?";%$
J31%.
f
[
"$
K
&
<
<
J
J
M Va
^%(%
6
%
<
<
4
K
X+#%
/*$(%.5*.
/%(
O<
S
J[
/0(#9
9
<
<
M
M
6
&
M
V
'
/09 M
/%(
K$
/0[VeGV
9 M
g
/%("#
/0(<*#/O[<9(&(*
9
(c) [2 points] More concretely, let’s assume that:
<
<
T
W
3*.-"#5F%
!"0(*.;
1#
M
/%(
K(
J[
M
h]
a
h
J
d
b
• 0 = 1: dice from factory 0 always roll a 1
/0(%*#9 M
• 1 = 0.5: dice from factory 1 are fair (roll a 1 with probability 0.5)
• ⌘0 = 0.7: we think with probability 0.7 that this die came from factory 1
3
*
J
6
<
6
9
&
<
J
<]
]b]
(*
6
a
6
6
6
&
CSC2515 Fall 2020
Midterm Test
(d)
i [2 points] Now we roll it, and it comes up 1! What is your posterior distribution
on which factory it came from? What is your predictive distribution on the value
of the next roll?
]&je]&cV]<Rj
/0(V##9V#
_b<_V_
MV
M
^O 1<
=
'
]bcV_
9
/0
<
/%(5(
_e]&RV]<R
/59%# b(V*c/0(V*JV
'
M
M
<
/0(<*#9(<**&5$k*&F$<<*&lmF/0[KVJ#9J<1/#<N#9M/#[KV##(J2(^O<NV
]&lm\e_<_]&jRoe]&j
J
2
J
[
M
n
V
]&R<_]&Rj
V
&
V
jRo
]&cR_
V
(e) [2 points] You roll it again, and it comes up 1 again.
Now, what is your posterior distribution on which factory it came from? What is
your predictive distribution on the value of the next roll?
^0(
/%(
<l
<
<
[
6
*
6
[
VN
6
6
Vb
__\V_c
6
K(
[F
6
M
/0(
<l]_ec
/%(
V
<
<
V
'
M
"
>511(
'9*,p
^0(
K( M __V]&ce]&je]&j<<]&_cj
J(
^O[!JG
/59
<<
_M
<
<
/0(<*$/091(#((*#/09F&(#
]&Rmo
<
/, 9K W
]&mR\
<
<
J 9
<_<<]cV]&Re_e_
M
E/0(#95/09
<
V
V
=
*
<
R
K(
]&Rmoe]&j
<_]<mR\e_V]&o_m
=
% [2 points] Instead, what if it rolls a 2 on the second roll?
(f)
5(
1.*-
5$
_&
1"0(*.;
1.*1"0(*.;
250%
\
&
]_e_V]
^%(%
^OA
.*##$
3%8%.
*
4%0"+$%
<
__e_V_
(g)
1 [2 points] In the general case (not using the numerical values we have been using)
prove that if you have two observations, and you use them to update your prior
in two steps (first conditioning on one observation and then conditioning on the
second), that no matter which order you do the updates in you will get the same
result.
M
_/0(9&/%kb
g "1(
&+"b/0(J9>/9FG>1g/0(R</095V95J(M&/09FV9;(
/%(
69
6
<
/%(
4
M
&
/09"
<
<
M
J(
&
/ q
[gV[5#(
M
M2(
CSC2515 Fall 2020
Midterm Test
3. Decision Trees & Information Gain [10 points]
16, 16
12, 4
4, 12
x, y
,
a, b
,
The figure above shows a decision tree. The numbers in each node represent the number
of examples in each class in that node. For example, in the root node, the numbers
16, 16 represent 16 examples in class 0 and 16 examples in class 1. To save space,
we have not written the number of examples in two right leaf nodes as they can be
deduced based on the values in the left leaf nodes.
(a) [2 points] What is the information gain of the split made at the root node?
Jr
J
$/#5(
M
GO` M
V
J
V
J
V
d1<
5(
9
<
n
<
&
G+%((
J
q
<
(5(#%
M
#*,$ k
<
=
M
.5,)($
1#*,
5<F0<51(*,&J>
<
!1
#*,
&
k Js
o_
]&_t
J
(b) 2 points] What values of a and b will give the smallest possible value of the
information gain at this split, and what is the value?
J1
Jr
$/#5($
0
J
$/#5(
GO` M
V
M
-53
<
"(
JJJ&
vh0w4
W
6
M
R]
G%
7)%3
n
V
GO`J[
JAS
GA` M
$530%
Ka
*.,
GO`J[
V
V
T
T
4
M
<
GO`J[
u
=
<
51
9*9*
M
7,F
<
6
<l
(c) [2 points] What values of x and y will give the largest possible value of the
information gain at this split, and what is the value?
GO`J[
7)%3
GO` M
$
GO`J[ M
G%; M
J'O $/#5( M
<
V
G 0#%1($
=
V
G%
5
.5,)(
_V]
h #*,*11
W
J
[;
J
*.
M
V
1; J
n
CSC2515 Fall 2020
Midterm Test
(d) [4 points] Bayes rule for conditional entropy states that H(Y |X) = H(X|Y )
H(X) + H(Y ). Using the definition of conditional and joint entropy, show that
this is the case.
Hint: First show that H(Y |X) = H(X, Y )
G
O
[
` M
6
7,F/
V
=
9
c_
O[ M
G
V
`M
J[&
=
<
h,
A
=
A[
;% KE
/
#*, ^
M
/59
G O[ M
<
J
<
0 9
6
T
M
;
K>
H(X)
K
T
K
6
V
heh 6
<
V
=
;M
59&
/
7,F/09
` M
6
#
<
vex
HY&;/09&;57,&/09&,#
<
&
mt\ ^
M
^59
eAe
T
M
K
/09$7,&^G s
<
yhhy
Tyy
Sy
5$
Sg
S
&5$v&&&&&&&&6/095V;A;^
e
53 &
T
g
Sk
T
W
&
&
W
` M
<
V
V
N
G%
W
` M
9&
<
=
(1
*1
3*("(5*3
$75(0)
eAexh;/
59&
A9 J36
<
O[ M
G
V
`
"32
[
6
O` J[
M
G
z
<
|
W
J[&
` M
<
G
GO`J[ M
` M
0
V
V
G
GO[J`
6
<
<
&
<
()%
75()
F%;^TK
h;^Ta4ts
(
J
M
<
G
<
z
=
/.**1 { G%
$"-%
[J ` M
J
M
#*,$ ^O> J[
M
^%9 ;
T
&53
S
T
;5#*,F/0;#9 M
K
E5-5#".#;
T
<
|
<
<
O[ M
(
GO Ka
[& ` M
V
G
O`
6
[M
CSC2515 Fall 2020
Midterm Test
4. Bayes Optimality [10 points]
(a) [2 points] Consider a regression task with the following data generating process:
x ⇠ N (µx , ⌃x )
t|x ⇠ wT x + ↵xT x + N (µ1 ,
2
1)
+ N (µ2 ,
2
2)
where x, µx 2 RD , ⌃x 2 RD⇥D ; w 2 RD and ↵ 2 R are parameters.
At a particular query point x⇤ 2 RD , what is the output of the Bayes optimal
regressor y ⇤ = f ⇤ (x⇤ )?
9 s
NA
} <_}$
Z>AA[ s (
k
A x
(59 s
(
;
h
q
Q[
6
=\
Sg
V
M
V7>Q9
3
sVA'HH
J' 5(
VQ9>Qe(J.
(."0%
1+30(5*3
7>}e(
<_\h
Q93
%9%
&
=
V
V
(.
M
(
A#
Q5(0)
(+(+
&
&
J(
JJ
;
V
J s
6
K
J1
S
!A
NAA
x
k Ds
K(
9
Q5( Q.
P".
6
75()
"}J}e
<_
'(.
Q(-
Z>Q[JAJ[>[
V
(
6
=
<_
(b) [2 points] For the data generating process in part (a), at a particular query point
x⇤ 2 RD , what is the expected squared error (defined as (t y ⇤ )2 ) of the Bayes
optimal regressor?
\M
A O (
;
F(;
A0(
<
;gM\V
<
K
V
U
sAA>A%(
<
q
V
V
AA>
P".
s;g(Ax ;T s
F
s
P'XA ( \AA(I;gU
K(
!NA>>
=
V
g =
g
K
s
J
<
A(
7
;
s
k
(
V
;
<
P".
x (
?";%$
s
A..*.
g = \
CSC2515 Fall 2020
Midterm Test
(c) [2 points] Consider a 3-class classification problem with the following data generating process:
x ⇠ unif(0, 1)
8
>
<0, p = 0.2
t|x ⇠ 1, p = 0.5x
>
:
2, p = 0.8 0.5x
where x 2 R and t 2 {0, 1, 2}.
What is the expression for the Bayes optimal classifier?
h
&
=
^.0
".,-"9
<
lV<le_
yy
;
]&jcV]&o
9
]&j
<
&
&
;gV
[
\
;
*&,# y
51
51
"
*
/1(#9
M
]&o
A#
]&o
&
(d) [4 points] What is the misclassification rate of the Bayes optimal classifier in
part (c)? Your answer should be a scalar.
/.#((;((#9p
&1-v.O(VJ#[J(^.5(V*#9a/09$29(9%1*&NH^H&0(Vl95(^.O(V*#9a/09$2
M
&
E5$0*
&
V
s* ~6
&$5b(*&\c9b29(1*HO*&o<*&j9(*&\M9#2[VN<*&Fj9\(*&F9>*o(
x
9
=
<
*
&
\je\
.
]&R\
V
V
(
J
<
]&lR
8
]&\j
M
<
>N
(*
<o
<
]&\je]&o\_c
CSC2515 Fall 2020
Midterm Test
5. Linear/Logistic Regression [10 points]
(a) [5 points] In lecture, we considered using linear regression for binary classification
on the targets {0, 1}. Here, we use a linear model
y = w> x + b
and squared error loss L(y, t) = 12 (y t)2 . We saw this has the problem that it
penalizes confident correct predictions. Will it fix this problem if we instead use
a modified hinge loss, shown below?
(
t = 0,
max(0, y)
L(y, t) =
t = 1,
1 min(1, y)
Justify your answer mathematically.
N
J
` <_V] M
=
V
q
-"9
6
d
(
N5 ;
6
;M
6
(
<
<
*
=
=
51
=
M
J
V
=
6
51
`*
=
=
V
-53
<
O
5
<
51
= J
Z>[
;
V
(
^.0
;%
(*
s
(
6
3*7
<
*
\
$5,-*52
75()
s
J
"(
1+30(5*3
4%
0"3
=
53(%./.%(%2
Q*2515%2 )53,%
()"(
"$$5,3$
530.%"$%$
&
#%$$
#*$$
*.
&
/.*4"45#5(;
"$
^.
O
(
*
<
<
M
J
V
J
V
&
0
;
<
<
J
M
V
<
•€yhV-jh"WVGGV
>)%
J
3*7
(*
J
"/.*3$
\t
M
#*$$
V
&
Va
V
<
<
6
$*
F 6
_M
V
A*
4
(
6
6
INJ`&(*b\N#`&(V*M&\`<<
^.%(%
n5#
`
5(S
TS0
J
K
w =
;
`A J
`
=
y
S
W
`%*
$p+"$)%2
4%
F
1*.
R]
=
51
*
W
0"3
;
` u
1
V
5(
h 6
51
51
; M
*
<
<
J
b
J
V
=
T
\N
J
*
4%)"8%$
0*..%0(
$5-5#".
(*
/.%250(5*3
9
O.*$$
;gV(
<
A3(.*/;
7)%3
<
J
#*$$
&
/.q;gV( M
_V]
CSC2515 Fall 2020
Midterm Test
(b) [5 points] Consider the logistic regression model y = g(wT x), trained using the
binary cross entropy loss function, where g(z) = 1+e1 z is the sigmoid function.
Your friend proposes a modified version of logistic regression, using
g(z) =
e z
1+e
z
The model would still be trained using the binary cross entropy loss. How would
the learnt model parameters, as well as the model predictions, di↵er from conventional logistic regression? Justify your answer mathematically. A purely graphical
explanation is not sufficient.
";
7(9(4
d
&
6yF
"
&
aV
N
d
M
0;$(
1;V
V
,0F
(
<
M
V
#*,;
O#
<
<
<
Y
A<
%%
V
T
#*,*
( M
J(
"
T
E
;
<
<
M
En
k
%$
YJ
[
;%(W 6T~
-yV3yhyyDh
";
JF
V
; M
<
<
V
<
1%%(
<___\
J
-53
Nd5;6(
>1
T
O*38%3(5*3"#
Z".3$
>)%
7
7%
W
#*,5$(50
<
#*,5$(50
c
<
6
A`A ~
;
<
X%,.%$$5*3
Z".3
<
"0(58"(5*3
!JA
6
T
(
=
V
O;
(
<
T
<
%
1+30(5*3
10
&
7(
V
9
T
9
M
5$
=
V
%<
6
>
IZ
&
= *
N*$$
<
NOA -*2
)
0*-%
6
T
T
•
F
`
<
T
M
Y
n
&
!`A
O`
T
<
(
T
M
T
9
6
T
.%8%.$%2
(*
0*38%3(5*3"#
-*2%#
6
CSC2515 Fall 2020
Midterm Test
6. L1/L2 Regularization [6 points]
(a) [2 points] The L1 norm is defined on a vector x as L1 |x| = max({xi : i =
1, 2, . . . , n}). As an example, if x = [ 6, 2, 4], then L1 |x| = 6 since the magnitude
of the largest element in x is 6.
On the loss plot below, draw the L1 norm and draw a point at the set of weights
the regularization term will select.
w2
ii7g
w1
(b) [4 points] What kind of weights does the L1 norm encourage our models to
learn? In which situations might this be useful?
‚
J(
GZ`
7-"9
V
%30*+.",%$
8%0(*.
J(
5$
6
"32
+$%1+#
(*
)"$
7)%3
75()53
-",35(+2%
53
.%2+0% ()% N".,%$(
0*38*#+(5*3"#
4*+32%2
1*.
%11%0(
3*
7%
11
3%%2
()%
C%+."#
"3; #%".353,
.%$(
()%
*3
()%
0*3$(."53
(*
+//%.
C%(7*.@$
."(%
7%5,)(
=
&
4*+32
&
=
()"(
6
53
&
7%5,)( *1
1*. %9"-/#%
"##
7%5,)(
()%
()%
6
-"95-+5$
5(
".%
7%5,)($
+%2
CSC2515 Fall 2020
Midterm Test
7. Classification Decision Boundaries [10 points]
1
-".,53$
,&
1R
"
t
B".,
n
1<
F
1F
53
1R
1p
(a) [1 point] Assume this is the first step in the boosting algorithm. In the figure above, sketch the decision boundary for the decision stump, optimizing for
weighted misclassification error. Label this boundary f1 .
(b) [2 points] Select an arbitrary point which f1 incorrectly classifies. Draw a circle
around the point you select. What is its weight in the next boosting iteration?
Assume the weights initially sum to 1.
7
6
V
O1$
6
5F
&
&
=
<
6
6
5( M
W
%..*.
yJ75JJ1509`ƒ(
V
V
`F
75
25
V
J #*,
Zp
0
7p %9/
w
Z,
J
<
%..
W
5
6
]&o]
J
_\\
&
&
J
()%
%9
T
K$
\e]&o
V
h
e
%
n
12
&
l_
(
(
T K
I
M
V
(1
CSC2515 Fall 2020
Midterm Test
(c) [1 point] In the figure above, sketch the decision boundary for a 1-layer decision
tree, optimizing for information gain. Label this boundary f2 . You do not have
to show that the boundary is optimal.
(d) [3 points] Compute the information gain of the split f2 .
Jr
=
GA` M
V
J
V
J
V
J
<
<
1F
GO`J
<
M
JA9*(/91<17,A<17$&!J
JJ]&o__
9
]&lm
(e) [1 point] In the figure above, sketch the decision boundary for a decision tree
with unlimited depth. Label this boundary f3 .
(f) [2 points] In the figure above, sketch the decision boundary for a hard-margin
SVM. Label this boundary f4 . Is the boundary unique? You do not have to
justify your answer.
C*
6
3*(
+35p+%
&
13
CSC2515 Fall 2020
Midterm Test
8. SVM [9 points]
Here are plots of wT x + b = 0 for di↵erent training methods along with the support
vectors. Points labeled +1 are in blue, points labeled 1 are in red. The line wT x+b =
0 is shown in bold; in addition we show the lines wT x + b = 1 and wT x + b = 1 in
non-bold. Support vectors have bold circles surrounding them.
1)
3)
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−2
−1
0
1
2
3
2)
−3
−3
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−2
−1
0
1
2
3
4)
−3
−3
−2
−1
0
1
2
3
−2
−1
0
1
2
3
For each method, list all figures that show a possible solution for that method and give
a brief explanation. Note that some methods may have more than one possible figure.
14
CSC2515 Fall 2020
Midterm Test
(a) [3 points]
n
X
1
kwk2 +
⇠t s.t. ⇠t
2
t=1
min
$*1(
= 1.
where
Q".,53
-"9
&
0, t(i) (wT x(i) + b)
W
1
JJ
7-&45F
⇠t t = 1, . . . , n
0
J5
T
K
=
(
<
%7&4F
6
07&4"k,5-*A.G7#J<<7-&45&3,A505
J5
T
f
/%3"#5F53,
_M
= 1.
4<
=
P
<
V
3*(
W
'
$#"0@$
W
4*+32".;
53
<
0"3
=
1
⇠t t = 1, . . . , n
/"."##%#
*.5,53
/"$$
$#"0@$
S
%#5-53"(%2
-".,53
-"9
-".,53
%30*+.",53, N".,%
$#"0@
4%
M
\
0, t(i) (wT x(i) )
4*+32".;
W
*
()%
$)51($
4
.%,+#".5F%
3*3
&
=
0"3
-"9
4%
%#5-53"(%2
1*.
-".,53
a
(c) [3 points]
where
$#"0@$
=
#".,% -".,53
%30*+.",%
3*(
„
3*3
n
X
1
2
min kwk +
⇠t s.t. ⇠t
2
t=1
min
*1
"-*+3($
(b) [3 points]
…
G+#@
-535-5F53,
3*(
K
(
<
where
K
n
X
1
kwk2 +
⇠t s.t. ⇠t
2
t=1
= 1.
4
$*1( -".,53
V
W
n
K
<
J
W
4*+32".;
.%,+#".5F%
lM
/"$$
15
J55
1
⇠t t = 1, . . . , n
*.5,53
-".,53
-"95-5F%
3D6
RM
0, t(i) (wT x(i) )
&
$#"0@$
=
&
55&
H"7$
6
h
JQQ
1*.
CSC2515 Fall 2020
Midterm Test
9. Gradient Descent [8 points]
For a homework assignment, you have implemented a logistic regression model
y (i) = (W x(i) + b)
You decide to train your model on a multi-class problem using gradient descent. However, before you can turn your assignment in, your friends stop by to give you some
suggestions.
(a) [2 points] Pat sees your regressor implementation and says there’s a much simpler
way! Just leave out sigmoids, and let f (x) = W x + b. The derivatives are a lot
less hassle and it runs faster.
What’s wrong with Pat’s approach?
1"9
51
75()
4;
M
5$
1+30(5*3
#*$$
N*
J
6
M
JJ N".,%
&
"$$*05"(%2
.%$52+"#
()%
6
".%
#".,% -",35(+2%$
*1
/.%250(5*3$
0*..%0(
(*
$p+"$)%2
3%(
/%3"#5F%2
.%$52+%
U
<
<
<
<
<
<
<
A
<
<
<
<
<
9
(b) [2 points] Chris comes in and says that your regressor is too complicated, but
for a di↵erent reason. The sigmoids are confusing and basically the same answer
would be computed if we used step functions instead.
7(9(4
#*$$
F
*485*+$
What’s wrong with Chris’s approach?
; W
V
=
•L*7yV21*F
G*7%8%.
\ N*
6
>F
<
\
J
V
%8%.;7)%.%
n
V
r."25%3(
= *
-%"3$
)"8%
"-*+3($
W
<
<
;
=
2%$0%3(
16
3*
=
5(
2%153%2
5$
W
A,,
(*
ZH
G,
W
`
J
%11%0(
8%.;
+/2"(%
5$
3*
<
(*
<
=5
K +$
K( K
6
&
0)"3,53, 7%5,)($ 4;
*.
1+30(5*3
VN
$-"##
*/
&
$-"##
%11%0(
= *3
#*$$
; k
CSC2515 Fall 2020
Midterm Test
(c) [2 points] Jody sees that you are handling a multi-class problem with 4 classes
by using 4 output values, where each target y (i) is actually a length-4 vector with
three 0 values and a single 1 value, indicating which class the associated x(i)
belongs to.
Jody suggests that you just encode the target y (i) values using integers 1, 2, 3,
and 4.
What’s wrong with Jody’s approach?
;
>".,%($
*3%
<
N
A
)*(
;
7)%3
A
q
V
\
6
8%0(*.$
<
;
51
J
R
6
6
n$
&
&
&
*
6
J
6
]
6 =
6
@
%3(.;
"32
#%(
0"#0+#"(53,
"$$5,3 7%5,)($
.%/.%$%3(%2
5$
+$53,
&
<
;
\
6
R
6
lM
6
*1
%#%-%3($
=
(".,%(
75##
6
*1
53$(%"2
5($
#*,
(*
O#
V
()%
N0%
_
5$
=
M
]
6
<
<
<
__\l
M
B
&
&
&
6
(*,,#53,
=
()%-
*3
"32
*11
&
(d) [2 points] Evelyn overhears your conversation with Jody and suggests that you
use as a target value y (i) a vector of two output values, each of which can be 0 or
1, to encode which one of four classes it belongs to.
What is good or bad about Evelyn’s approach?
?53".; 0*253,
;
r**2$
S
K
T
5($
J
V
a
6
`F
25-%3$5*3
M
;
@
*
V
*
`F
_$(
*
<
5$
6
Sa
DS
1"0(*.
.%*,+"p&4"
4 "2$
S
)".2
(*
53
5-/#%-%3(
*/(5-5F"(5*3
"#,*.5()-
-535-5F%
(*
N%%
K
7)%.%
NdV
B
<
;
K
O#*,
17
J
M
&
=
<
R
6
0#"$$
6
l()
0#"$$
CSC2515 Fall 2020
Midterm Test
10. Maximum Likelihood and Maximum A Posteriori [11 points]
Your company has developed a test for COVID-19. The test has a false positive rate
of ↵, and a false negative rate of .
(a) [2 points] Assume that COVID-19 is evenly distributed through the population,
and that the prevalence of the disease is . What is the accuracy of your test on
the general population?
(%$( M
0*325(5*3
/*$5(58%
q
0*325(5*3
3%,"(58%
;
<
51
1"#$% 3%,"(58%
H;(.+%
C%,"(58%
1"#$%
/*$5(58%
6
/.% "00+."(%
/*$5(58%
(.+%
<
&
J
<
1M ;
q (
V
"
q
(
J
P M
<
T
<
(
M
{
(b) [2 points] Your company conducts a clinical trial. They find n people, all of
whom they know have COVID. What is the likelihood that your test makes n+
correct predictions?
nPJa
<
= K
3
t
)(
/*$5(58%
6
…
)(
N
3%,"(58%
&
q
V
J
/
/6
<
3
<
)(
(c) [5 points] Derive the maximum likelihood estimate for . You may assume all
other parameters are fixed.
)(
Vm,l^V03
C`O
N
#5%
u
J
? M
)( J
<
K
GJ>
q
3
(1
<
3
<
)(
<
)(
C11
)((
3/(
V
V
)53(
3
18
U
1*
<
V
M
1
#*,
J%(
<
1
V
*
1
3
<
)(
O#
)(
<
1
M
(*,"1 M
<
CSC2515 Fall 2020
Midterm Test
(d) [4 points] Derive the MAP estimate for , assuming it has a prior P ( ) = Beta(a, b).
You may assume all other parameters are fixed.
<
&
†
35
/53
.5
<
N5/$
6
V
/6
&
^5/
aM
J
".,-"/9
".,-,9
V
^Aa
-"/
V
V
/5/
M
#*, /%/
(
M
##0*,1
n
<
<
J1
^0* aM
6
V
<
"<((3/<J<45&JQ
"
<
J(
)
‡
)(
<
<
#"
<
"<
V
J(3
3
J(
"(4<\k
19
)(
<
<
&
O4
;0*,0
U
#"
".,-/"9
#5/ M
#*,
<
V
M
^OaJ1
<
M
)(
1
<
q4
<
=
<
1M
J(3 M
3((#*,1(3(#*,0(
(%3
<
1
<
<
*
<
1
M
CSC2515 Fall 2020
Midterm Test
11. Principal Component Analysis [11 points]
Suppose that you are given a dataset of N samples, i.e., x(i) 2 RD for i = 1,
. We
P2, ..., N
(i)
say that the data is centered if the mean of samples is equal to 0, i.e., N1 N
x
= 0.
i=1
For a given unit direction u such that kuk2 = 1, we denote by Pu (x) the Euclidean
projection of x on u. Recall that projection is given by
Pu (x) = u> xu
(a) [3 points] Mean of data after projecting on u: Show that the projected data with
samples Pu (x(i) ) in any unit direction u is still centered. That is show,
n
1 X
Pu (x(i) ) = 0.
N i=1
C
J(
"( h
T
^"59 M
V
+( 9K
T
+
6
6
:>
V
:>
V
V
20
qA[
J3
n
&
n
C
T
5(
&
:
M
:> "32
:
w
0)"3,%
+
75()
2*3K(
5
&
CSC2515 Fall 2020
Midterm Test
(b) [4 points] Maximum variance: Recall that the first principle component u⇤ is
given by the largest eigenvector of the sample covariance (eigenvector associated
to the largest eigenvalue). That is,
N
1 X (i) (i) >
u⇤ = argmax u
x [x ] u.
N i=1
u : kuk2 =1
>
Show that the unit direction u that maximizes the variance of the projected data
corresponds to the first principle component of the data. That is, show
N
X
u⇤ = argmax
Pu (x )
u : kuk2 =1 i=1
AN
E530%
9
P".1%+(9 sV
*
V
51
'^>
/ ()
}
M
9
6
J
+(
^)/
V
V
K>
!A
9
J(+
T
=K
K$
x ^"59
J(
GNA[J[J>
V
V/
/
K$
sV
>[>
<
6
5$
%5,%38%0(*.
=
-"(.59
=
T
^"59
M
+
V
"32
V*
'
6
"(."
<
V
6
T
[5
[5
V
7)%.%
+(
E530%
s
O[
O*.
J
T
<
N
2
1 X
Pu (x(j) ) .
N j=1
2
(i)
5$ =
"
$
;*+
J"5
6
<
<
"
6
T
JX
A
O9 M
-"(.59
%5,%38"#+%
>
P".1/+#9#(
^+(
A
9
T
M
&
5
JJJ&
J
GA
k
^+0
6
J
J
V
H%(
JJJ >1
T
:>
+
9K A[
T
"32
"(
7)%.%
:
>+ M
K= = K
+(
=
V
T
T
J
M
+
+
&
:>
=K
$0"#".$
".%
W
>+
9
T
K
V
&
I+
6
:> :>
V
A
+
&
+
&
<
:> +(
V
V
^O'
=
&
"(."
+
&
+
"(
$
"
6
>."53
".,-"+9 1(/&055,&5pv1^+095`G5MV/
W
?;
:>
^"/5
+
".,-"+9#A:>[> M
P".
+gV
E*
g
+
V
".,-"+91
21
"
&
(*/
6
V/
SM &(8v^+ K##
A9/"32
K$
Va
1"+#(;
V
T
T
6
J
*1
6
%5,%38%0(*.
\
CSC2515 Fall 2020
Midterm Test
(c) [4 points] Minimum error: Show that the unit direction u that minimizes the
mean squared error between projected data points Pu (x(i) ) and the original data
points x(i) corresponds to the first principal component u⇤ . That is show,
N
X
u⇤ = argmin
kx(i)
u : kuk2 =1 i=1
j€
/+19
T
#%(
K
M
:>
V
*/(5-"#
-%%($
+g
-53(9555
A52
555&
K
i
<
<
<
<
<
6
+
K
J
6
T
5$
=
(1)
^+0
.%0*3$(.+0(5*3 *1
9
T
M
0(5(%.5"3
-*9"
()%
?;
9K
T
Pu (x(i) )k22 .
/;()",*.%"3
()%
v
=
6
"32
>JA
(
K$ !ANJ[
K$
-"9
()%*.%-
T
,58%3
}
V
n
V-"eKv
6
8".#%( +#
v
J$
T
J
<
AG
GJQ
(7*
<
QA
h J&A
<
E*
e
M
Q## v
"
0*3$("3(
<
&
8".59
<
<
V
3
<
\
T
T
",-&53v&&&-55&/79536#)<<".,-"+9!,##^+
T
C
".,-"9
V
W
=
<
J
+(+(
".,-+"9
V
/
/6
7)%.%
75()
5$
()%
"(
w
T
=
J
+
+
^'^ >+
(*/
%5,%38"#+%
J
V
T
+
"
+g>.+g
%5,%38%0(*.
6
9K5$
".% =
`++
!AJ
:>:>^'^>::
W
6
M
KK
K
K
".,-"+9
V
+
9K 9K
:> :>
g
+g
22
<
8555&
".,-"+9
V
9KT
K## )5 J3 M
=K
T
J
5
V
+(
+
+()59
C
".,-"+9
V
9
+
&
5(
+
:
J
>5
T
<
J
K##
$
*1
"
A
6
T >
V
=
9K
$0"#".$
:
CSC2515 Fall 2020
Midterm Test
12. Probabilistic Models [15 points]
We would like to build a model that predicts a disease D in a patient. Assume we have
two classes in our label set t: diseased and healthy. We have a set of binary patient
features x = [x1 , ..., xN ] that describe the various comorbidities of our patient. We
want to write down the joint distribution of our model.
(a) [4 points]W hat is the problem in calculating this joint distribution? How many
parameters would be required to construct this model? How does the Naive Bayes
model handle this problem?
/09
M
a
J
5$
7)50)
.%p+5.%$
C"58%
?";%$
AA
\
25.%0(#;
$/%051;
=
=
0*-/"0(#;
8".5"(%$
(."0("4#%
25$(.54+(5*3
/59 J( M
&
(b) [3 points] Write down the joint distribution under the Naive Bayes assumption.
How many parameters are specified under the model now?
^
O
(
^.5*.
6
[5
6
<
6
<
Ra
9/
<
6
/.*4"45#5(;
O*325(5*3"#
n3#;
9F
5
/.*4"45#5(;
( J
M
V
^.
/%( M
O(
<
<
/%3"#(;
&
J
M
V
/"."-%(%.$
23
53
&
&
n1
/09,&#( M
S
&
(*("#
V
nH%
/09/#( M
6
&
.%/.%$%3(%2
4*()
".%
0*325(5*3
+32%.
532%/%32%3(
".%
%"(53,
"32
531%.%30%
W
4%
."32*-
453".;
S
s
0"3
a
*8%.
(*
%3(.5%$
A%9H
$(.+0(+.%
W
T
'$$+-%$
"32
[5 J
25$(.54+(5*3
H*53(
"
CSC2515 Fall 2020
Midterm Test
(c) [4 points] Why does the Naive Bayes assumption allow for the model parameters
to be learned efficiently? Write down the log-likelihood to explain why. Here,
assume that you observed N data points (x(i) , t(i) ) for i = 1, 2, ..., N .
K
J5
)*(
#*, /%(
T
&
=K
=
J
JJ&
V
= 6
6
!A
V
T
7,/0(
6
T
/%(
)*,
/09,
6
T K
T
J(
J
M
(&!Jv~#*,^09H5K J(
J
(
T
M
6
1"4
^.
V
r*"#
[H
O
*3
*
&
=
=
&
6
= "
V
4M
V
]#4
<
".,*
v
4% =
"##*7$
()%
#%".3%2
%11505%3(#;
JJ
T
M
#*,
#5@%#;
<
9H7,*&*(&JA
&
*3 6
"
<
%KT
@5
=
<
K
P*,%#
<
r* M
6
)**2
=
/"."-%(%.$
6
53(*
5$
%"0)
4%0"+$%
+
25$(.54+(5*3
1"0(*.5F%$
2%0*-/*$%2
1%"(+.%
75()
=
&
K
&"%v&5AAK9H5)*,*3(&&3A<5@5<9H5j7,+<
=
U
?";%$ = "$$+-/(5*3
0*325(5*3"#
n*4
<
".,*&-&,,vJ&7,/09H5K J(
V
V
C"58%
J
&
=
532%/%32%3(
0"3
(%.-$
=
&
(d) [1 points] Show how you can use Bayes rule to predict a class given a data point.
AJ /09H#(
^
/0(#9
M
a%153%
A&
,
/%($
V
>%%
=
"
;
=
&
g
=
/%( K$ ^O[G J
9
J
N5;"
s
/%(
6
"32
6
N$%
1*.
JJ
K
K
AA <l
V
6
V
<
1+30(5*3
#*$$
/%(
M
/09#(
K$
K
/59H#(
M
6
0)**$%
JO` ( M
<
M
J>
=
;
k
V
=
".,-5;3
A
KNN
J> &(#
J
e
<
<
<
(e) [3 points] Explain how placing a Beta prior on parameters is helpful for the
Naive Bayes model.
/.5*.
()*+,)(
/$%+2*
"$
0*+3($
<
^*$(%.5*.
&
25$(.54+(5*3
/"."-%(%.
75##
/.5*.
"$
A&
*1
=
?%("
()"(
0*3H+,"(% /.5*.
5$
,
O*53
&
1#5/
^0* Ja
M
/#*(
'
'
V
/#
&
x ,"
n
"
<
<
aJn
*/
4
(
0 6=
J = (
M
<
<
s
J
4
CG
O
J
<
n
24
M
x nC
&
<
J
>C>
T
O)* 6
C>
s
)"8%
=
()%
"
"32
4
$"-%
1*.-
0"3
4%
s
CSC2515 Fall 2020
Midterm Test
13. K-means [12 points]
In the following plots of k-means clusters, circles with di↵erent colors represent points
in di↵erent clusters, and stars represent cluster centers. For each plot, describe the
problem with the data or algorithm, and propose a solution to find a more appropriate
clustering.
(a) [2 points]
"32 ,.%%3
.%2
$(".
53
#*0"#
#*0"#
4"2
E*#+(5*3
*/(5-"
>.;
3*3
<
#*0"#
/*53(
$/#5(
"32
>)%
B
"
45,
)"8%
-%"3$
<
.+3353,
@%%/
"#,*.5()-
()%
J1
53
&
()%
#%"$(
0#+$(%.
>%$(
()%
-*8%$
-%.,%
E5-+#("3%*+$#; -%.,%
$/#5(
25
".%
&
$(".(
."32*-
<
(b) [2 points]
()%;
"32
6
S
z >.;
|
-535-"
$(+0@
".%
(7*
53(*
(7*
&
0*38%.,%2 ;%(
3*(
1*. 0*38%.,%30%
2*%$
"$$5,3-%3( $(%/
)"8%
0*38%.,%2
6
(*
S
3*(
0)"3,%
()%3
"
6
.%(5(#53,
"32
"$$5,3-%3(
"$$5,3-%3(
"32
0#+$(%.$
3%".4;
7%
#*0"#
"(
-535-+-
&
CSC2515 Fall 2020
Midterm Test
(c) [2 points]
);/%. /"."-%(%.
&
*1
3+-4%.
#".,%
.%$+#(
E*#+(5*3
(*
53
=
$%(
2"("
()%
53
%9"-/#%
*1
3+-4%.
2%(%.-53%
=
X%2+0%
&
75##
@
.%$+#($
0#+$(%.53,
25",3*$(50
(**
5$
0#+$(%.$
/**.
.+3
S
6
*/(5-"#
3*(
6
B
6
0)%%@
0#+$(%.$
B
53
&
(d) [3 points] The k-means algorithm is designed to reduce the mean squared error
between the cluster centers and their assigned points. Will increasing the value
of k ever increase this error term? Why or why not?
"#7";$
B
530.%"$%
C*
6
EE!
>)%
*1
3+-4%.
25$(*.(5*3$
.%2+0%$
EEA
75()53
0#+$(%.
%$$%3(5"##;
5$
J30.%"$53,
.%2+0%$
"32 ()+$
6
8".5"30%
&
"#7";$
"8%.",%
(*
0#+$(%.$
6
*12"("
8".5"30%
(e) [3 points] Suppose we achieve a very low error with a large value of k, and a
higher error with a smaller value of k. Should we select the k value with the
lowest error? Why or why not?
Cn
6
()"(
#*7
*1
.+#%
EEA
%..*.
0#+$(%.
(5##
()+-4
2%0.%"$%$
4+(
6
=
=
(*
=
B
/+./*$% *1 0#+$(%.53,
1#"((%3$
(*
*/(5-"#
*8%.
/*53( /%.
,.*+/
%#4*7
+$%
"4.+/(#; "32
530.%"$53,
*3%
5$
0#+$(%.
$5-5#".
26
53
-%()*2
*+(
8"#+%
%9(.%-%
2"("
&
N".,%
1+.()%.
0"$%
1"5#%2
&
1532
(*
B
,+"."3(%%$
25852%$
7)%.%
@
()%
()%
()5$
&
CSC2515 Fall 2020
Midterm Test
14. Expectation-Maximization [8 points]
Here we are estimating a mixture of two Gaussians via the EM algorithm. The mixture
distribution over x is given by
P (x; ✓) = P (1)N (x; µ1 ,
2
1)
+ P (2)N (x; µ2 ,
2
2)
Any student in this class could solve this estimation problem easily. Well, one student,
devious as they were, scrambled the order of figures illustrating EM updates. They may
have also slipped in a figure that does not belong. Your task is to extract the figures of
successive updates and explain why your ordering makes sense from the point of view
of how the EM algorithm works. All the figures plot P (1)N (x; µ1 , 12 ) as a function of
x with a solid line and P (2)N (x; µ2 , 22 ) with a dashed line.
(a) [2 points] (True/False) In the mixture model, we can identify the most likely T
posterior assignment, i.e., j that maximizes P (j | x), by comparing the values of
P (1)N (x; µ1 , 12 ) and P (1)N (x; µ2 , 22 )
"$$5,3%2 4;
C%9 D} D */ M
%9"0(#;
5$
>X:A
6
()"(
-"95-5F%$
/*$(%.5*.
>)%
=
27
/095#95%@H
R
&
6
CSC2515 Fall 2020
Midterm Test
(b) [2 points] Assign two figures to the correct steps in the EM algorithm.
- Step 0: ( ") initial mixture distribution
- Step 1: ( O) after one EM-iteration
(c) [4 points] Briefly explain how the mixture you chose for “step 1” follows from
the mixture you have in “step 0”.
J35(5"#5F"(5*3
r**2
A
<
$(%/
S
#%(
/."0(50%
%"0)
*1
'$$5,3 .%$/*3$545#5(;
.%$/*3$545#5(;
1*.
K
$(%/
&
2"("
"
A
J
J
9 A
j&m&cR
*
6
J
L
-"95-5F%
(*
8".5"30%$
'$
2"("
r"+$$5"3 0*-/*3%3(
+/2"(%
()5$
0*-/*3%3( 1*.
[
1*.
Q
-%"3
=
=
S
."32*-
)"8%
0#+$(%.
.%$+#(
75()
$%(
7%5,)(%2
()%
/.*4"45#5(;
[A
Q
(*
6
$*#52
J
(*
$)51($
28
>$*#52
5(
*1
4%0*-%$
M
,%3%."(%
()%
=
()%
5$
N53%
-+0) #".,%.
1".
N%1(
#%1(
5$
5(
2"("
(*
"32
L
_j6mc
[A
(*
#53%
$/"3
O53%
"32
&
-*8%$
"32
!#*+(%2
-*2%#
#53%
-%"3$
(*
#53%
$*#52
(*
2*((%2
(*
"$$5,3%2
2%85"(5*3
/*$(%.5*. /.*4"45#5(;
+$53,
75()
=
$("32".2
"$$5,3%2 )5,)
5$
5$
0+..%3(
Q $*#52
&
4.*"2
"32
=
0#*$%.
(*
=
()%
.%$/*3$54#%
"00*+3(
5($
&
$5-+#("3%*+$#;
1*.
-%"3
&
CSC2515 Fall 2020
Midterm Test
Bonus [+1 Point] Guess the percentage grade that you will receive on this examination
out of 100% (excluding this bonus question and any formatting penalty). You will receive 1
point if your guess is within 5% of your actual mark.
o]€
Survey Questions
The following survey questions are completely optional, and will not impact your grade
in any way. By answering these questions, you give permission for your data to be used in
aggregate descriptive statistics. Results from these analyses will only be disseminated to
students in the class.
A1. I have previously taken an “Introduction to machine learning” or equivalent course (at
any university).
n
Yes
No
A2. My home department for graduate studies is:
CS
ECE
Other:
QJA
A3. I would describe the overall difficulty of the course so far as:
Very easy
Easy
Average
29
Hard
n
Very hard
Téléchargement