CSC2515 Fall 2020 Midterm Test Start your scan at this page. 1. True/False [16 points] For each statement below, say whether it is true or false, and give a one or two sentence justification of your answer. (a) [2 points] As we increase the number of parameters in a model, both the bias and variance of our predictions should decrease since we can better fit the data. !"#$% & '#()*+,) "$ 7% "32 7% "22 /"."-%(%.$ -*.% 6 -*2%#$ %9/%0( 0*-/#%9 ()% *8%. -*2%# *8%."## 15( *1 $(.+0(+.% ()% /"."-%(%.$ )%#/ 0"/(+.% -*.% 7)50) 6 2"(" -*.% 5$ 75## 0*-/#%9 6 6 ()% 530.%"$% 45"$ .%2+0%$ "32 8".5"30% (b) [2 points] We can always improve the performance of clustering algorithms like k-means by removing features and reducing the dimensionality of our data. !"#$% & :3#%$$ 2"(" ()% .%2+0(5*3 *1 #*$$ 0"+$% 0"3 53(.53$50"##; ".% 5-/*.("3( = #*7 25-%3$5*3"# < *1 1"0(*.$ 8".5"(5*3 = 1%"(+.%$ .%-*853, & & = "32 531*.-"(5*3 2"(" "32 = 2%-*3$ = 5*3 #*$$ (c) [2 points] Bagging and boosting are both ways of reducing the variance of our models. >.+% & ?",,53, ?**$(53, .%2+0%$ *1 8".5"30% -*2%#$ *1 8".5"30% .%2+0%$ = -*2%#$ $530% 4%0"+$% = 5( = 7% ("@%$ = = "8%.",53, ".% 7%5,)(%2 *1 "8%.",% %9"-/#%$ *8%. = 7%"@ #%".3%.$ (d) [2 points] Machine learning models work automatically without the need for the user to manually set hyperparameters. !"#$% & :$%. "( = 3%%2$ ()% ()% A& , 4%,53353, B= ()% = -*2%# );/%./"."-%(%. ,58%3 4%$( & 6 "32 );/%./"."-%(%.$ -"3+"##; $%( /.*/%.#; (* = 53 @ < CC DE 53 -535 < 0*-/".% 0"3 = 53 4"(0) " = 8"#52"(5*3 ."3,% 2%153%2 $(*0)"$(50 1 " = +$%.$ 2%$0%3( ."3,% = *. = $/%05150 "+(*-"(50"##; %..*.$ 4; ,."25%3( = (* & & (* 8"#+% $%#%0( & CSC2515 Fall 2020 Midterm Test (e) [2 points] Most machine learning algorithms, in general, can learn a model with a Bayes optimal error rate. !"#$% "#,*.5()- & ?";%$ */(5-"# ;5%#2$ 7)50) ."(% %..*. 45"$ F%.* 5$ 8".5"30% "32 = 4%$( ()% = G".2#; & "#,*.5()- "3; 2* (* )*/% %8%. = 7% = 75() = 0*8%.%2 #%".353, "3; ()5$ 53 "0)5%8%$ 0*+.$% ()5$ = 6 & (f) [2 points] The k-means algorithm is guaranteed to converge to the global minimum !"#$% >)% 3*( B < & I#J- *4H%0(58% = -%"3$ $(+0@ = < 0*38%9 5$ = 3*( -%"3$ ()% ,#*4"# 53 ()% -535-" #*0"# 53 @ (* 0*38%.,% (* ,+"."3(%%2 (. L M *1 K$ -535-+- '#$* & "#,*.5()- ()%.% = 2%$0%3( 0**.253"(% = $* 6 5$ = I *3 5$ /.%8%3( (* 3*()53, = & (g) [2 points] When optimizing a regression problem where we know that only some of our features are useful, we should use L1 regularization. >.+% J1 ;*+ & 4%#5%8% ;*+ = N6 +$% *3#; ()"( .%,+#".5F"(5*3 *1 1."0(5*3 $-"## " %30*+.",%$ = $-"## = 1%"(+.%$ -"3; 7%5,)($ = = 4% (* #5@%#; 5$ = %9"0(#; .%#%8"3( 4% (* = F%.* "32 6 ,58%$ ;*+ 8%0(*. $/".$% */(5-"# 7%5,)( (h) [2 points] For a given binary classification problem, a linear SVM model and a logistic regression model will converge to the same decision boundary. !"#$% !53% V >)%; & N*,5$(50 $* = ()+$ EPQ S W 73&/5&3,&A&05 =K < ;# T 0+.4 N*,5$(50 X%,.%$$5*3 -53 !J N50% 6 OY & =( 6 ( #*$$ A3(.*/; J U F(.#7## G53,% "## ()% 3*( = 8%.; 5$ = $"-% $5-5#". $5-5#". 4*+32".5%$ 2%05$5*3 = 75() M & $+,,%$(53, ".% $5-5#". O#"$$515%. N53%". O.*$$ < ()%; 4+( $5-5#". ".% 6 4+( & (* >)% ()% "$-/(*50 7%5,)($ 1.*- ()% $"-% = 3*( = ,."25%3( & R K #*$$ 53 #*$$ G53,% #*$$ "32 N" .%,+#".5F"(5*3 2 < ZJ> F =$ *1 4%)"85*. EPQ 6 2%$0%3( 6 < * M R CSC2515 Fall 2020 Midterm Test 2. Probability and Bayes Rule [14 points] You have just purchased a two-sided die, which can come up either 1 or 2. You want to use your crazy die in some betting games with friends later this evening, but first you want to know the probability that it will roll a 1. You know know it came either from factory 0 or factory 1, but not which. Factory 0 produces dice that roll a 1 with probability 0 . Factory 1 produces dice that roll a 1 with probability 1 . You believe initially that your die came from factory 1 with probability ⌘1 . (a) [2 points] Without seeing any rolls of this die, what would be your predicted probability that it would roll at 1? L .*## 2%3*(% ^.0 9 [ "$ _M V \] 0< V V 2* V L J & A< J J O# & D & ; < = * M ^.%(* ^.%( /09 V/ J(/09 ( 1"0(*.; M ( 6 (% "$ M & U #* J 6 6 <_ M < < & ] ` 6 6 (b) [2 points] If we roll the die and observe the outcome, what can we infer about where the coin was manufactured? L D )"4 5$ 0*+3( 1*. 9HV" (% #* "$ R_6\R b9V\ J 1"0(*.; ^Q&Q[b(M'/0(c&/09J( .*## 2%3*(% ?";%$ J31%. f [ "$ K & < < J J M Va ^%(% 6 % < < 4 K X+#% /*$(%.5*. /%( O< S J[ /0(#9 9 < < M M 6 & M V ' /09 M /%( K$ /0[VeGV 9 M g /%("# /0(<*#/O[<9(&(* 9 (c) [2 points] More concretely, let’s assume that: < < T W 3*.-"#5F% !"0(*.; 1# M /%( K( J[ M h] a h J d b • 0 = 1: dice from factory 0 always roll a 1 /0(%*#9 M • 1 = 0.5: dice from factory 1 are fair (roll a 1 with probability 0.5) • ⌘0 = 0.7: we think with probability 0.7 that this die came from factory 1 3 * J 6 < 6 9 & < J <] ]b] (* 6 a 6 6 6 & CSC2515 Fall 2020 Midterm Test (d) i [2 points] Now we roll it, and it comes up 1! What is your posterior distribution on which factory it came from? What is your predictive distribution on the value of the next roll? ]&je]&cV]<Rj /0(V##9V# _b<_V_ MV M ^O 1< = ' ]bcV_ 9 /0 < /%(5( _e]&RV]<R /59%# b(V*c/0(V*JV ' M M < /0(<*#9(<**&5$k*&F$<<*&lmF/0[KVJ#9J<1/#<N#9M/#[KV##(J2(^O<NV ]&lm\e_<_]&jRoe]&j J 2 J [ M n V ]&R<_]&Rj V & V jRo ]&cR_ V (e) [2 points] You roll it again, and it comes up 1 again. Now, what is your posterior distribution on which factory it came from? What is your predictive distribution on the value of the next roll? ^0( /%( <l < < [ 6 * 6 [ VN 6 6 Vb __\V_c 6 K( [F 6 M /0( <l]_ec /%( V < < V ' M " >511( '9*,p ^0( K( M __V]&ce]&je]&j<<]&_cj J( ^O[!JG /59 << _M < < /0(<*$/091(#((*#/09F&(# ]&Rmo < /, 9K W ]&mR\ < < J 9 <_<<]cV]&Re_e_ M E/0(#95/09 < V V = * < R K( ]&Rmoe]&j <_]<mR\e_V]&o_m = % [2 points] Instead, what if it rolls a 2 on the second roll? (f) 5( 1.*- 5$ _& 1"0(*.; 1.*1"0(*.; 250% \ & ]_e_V] ^%(% ^OA .*##$ 3%8%. * 4%0"+$% < __e_V_ (g) 1 [2 points] In the general case (not using the numerical values we have been using) prove that if you have two observations, and you use them to update your prior in two steps (first conditioning on one observation and then conditioning on the second), that no matter which order you do the updates in you will get the same result. M _/0(9&/%kb g "1( &+"b/0(J9>/9FG>1g/0(R</095V95J(M&/09FV9;( /%( 69 6 < /%( 4 M & /09" < < M J( & / q [gV[5#( M M2( CSC2515 Fall 2020 Midterm Test 3. Decision Trees & Information Gain [10 points] 16, 16 12, 4 4, 12 x, y , a, b , The figure above shows a decision tree. The numbers in each node represent the number of examples in each class in that node. For example, in the root node, the numbers 16, 16 represent 16 examples in class 0 and 16 examples in class 1. To save space, we have not written the number of examples in two right leaf nodes as they can be deduced based on the values in the left leaf nodes. (a) [2 points] What is the information gain of the split made at the root node? Jr J $/#5( M GO` M V J V J V d1< 5( 9 < n < & G+%(( J q < (5(#% M #*,$ k < = M .5,)($ 1#*, 5<F0<51(*,&J> < !1 #*, & k Js o_ ]&_t J (b) 2 points] What values of a and b will give the smallest possible value of the information gain at this split, and what is the value? J1 Jr $/#5($ 0 J $/#5( GO` M V M -53 < "( JJJ& vh0w4 W 6 M R] G% 7)%3 n V GO`J[ JAS GA` M $530% Ka *., GO`J[ V V T T 4 M < GO`J[ u = < 51 9*9* M 7,F < 6 <l (c) [2 points] What values of x and y will give the largest possible value of the information gain at this split, and what is the value? GO`J[ 7)%3 GO` M $ GO`J[ M G%; M J'O $/#5( M < V G 0#%1($ = V G% 5 .5,)( _V] h #*,*11 W J [; J *. M V 1; J n CSC2515 Fall 2020 Midterm Test (d) [4 points] Bayes rule for conditional entropy states that H(Y |X) = H(X|Y ) H(X) + H(Y ). Using the definition of conditional and joint entropy, show that this is the case. Hint: First show that H(Y |X) = H(X, Y ) G O [ ` M 6 7,F/ V = 9 c_ O[ M G V `M J[& = < h, A = A[ ;% KE / #*, ^ M /59 G O[ M < J < 0 9 6 T M ; K> H(X) K T K 6 V heh 6 < V = ;M 59& / 7,F/09 ` M 6 # < vex HY&;/09&;57,&/09&,# < & mt\ ^ M ^59 eAe T M K /09$7,&^G s < yhhy Tyy Sy 5$ Sg S &5$v&&&&&&&&6/095V;A;^ e 53 & T g Sk T W & & W ` M < V V N G% W ` M 9& < = (1 *1 3*("(5*3 $75(0) eAexh;/ 59& A9 J36 < O[ M G V ` "32 [ 6 O` J[ M G z < | W J[& ` M < G GO`J[ M ` M 0 V V G GO[J` 6 < < & < ()% 75() F%;^TK h;^Ta4ts ( J M < G < z = /.**1 { G% $"-% [J ` M J M #*,$ ^O> J[ M ^%9 ; T &53 S T ;5#*,F/0;#9 M K E5-5#".#; T < | < < O[ M ( GO Ka [& ` M V G O` 6 [M CSC2515 Fall 2020 Midterm Test 4. Bayes Optimality [10 points] (a) [2 points] Consider a regression task with the following data generating process: x ⇠ N (µx , ⌃x ) t|x ⇠ wT x + ↵xT x + N (µ1 , 2 1) + N (µ2 , 2 2) where x, µx 2 RD , ⌃x 2 RD⇥D ; w 2 RD and ↵ 2 R are parameters. At a particular query point x⇤ 2 RD , what is the output of the Bayes optimal regressor y ⇤ = f ⇤ (x⇤ )? 9 s NA } <_}$ Z>AA[ s ( k A x (59 s ( ; h q Q[ 6 =\ Sg V M V7>Q9 3 sVA'HH J' 5( VQ9>Qe(J. (."0% 1+30(5*3 7>}e( <_\h Q93 %9% & = V V (. M ( A# Q5(0) (+(+ & & J( JJ ; V J s 6 K J1 S !A NAA x k Ds K( 9 Q5( Q. P". 6 75() "}J}e <_ '(. Q(- Z>Q[JAJ[>[ V ( 6 = <_ (b) [2 points] For the data generating process in part (a), at a particular query point x⇤ 2 RD , what is the expected squared error (defined as (t y ⇤ )2 ) of the Bayes optimal regressor? \M A O ( ; F(; A0( < ;gM\V < K V U sAA>A%( < q V V AA> P". s;g(Ax ;T s F s P'XA ( \AA(I;gU K( !NA>> = V g = g K s J < A( 7 ; s k ( V ; < P". x ( ?";%$ s A..*. g = \ CSC2515 Fall 2020 Midterm Test (c) [2 points] Consider a 3-class classification problem with the following data generating process: x ⇠ unif(0, 1) 8 > <0, p = 0.2 t|x ⇠ 1, p = 0.5x > : 2, p = 0.8 0.5x where x 2 R and t 2 {0, 1, 2}. What is the expression for the Bayes optimal classifier? h & = ^.0 ".,-"9 < lV<le_ yy ; ]&jcV]&o 9 ]&j < & & ;gV [ \ ; *&,# y 51 51 " * /1(#9 M ]&o A# ]&o & (d) [4 points] What is the misclassification rate of the Bayes optimal classifier in part (c)? Your answer should be a scalar. /.#((;((#9p &1-v.O(VJ#[J(^.5(V*#9a/09$29(9%1*&NH^H&0(Vl95(^.O(V*#9a/09$2 M & E5$0* & V s* ~6 &$5b(*&\c9b29(1*HO*&o<*&j9(*&\M9#2[VN<*&Fj9\(*&F9>*o( x 9 = < * & \je\ . ]&R\ V V ( J < ]&lR 8 ]&\j M < >N (* <o < ]&\je]&o\_c CSC2515 Fall 2020 Midterm Test 5. Linear/Logistic Regression [10 points] (a) [5 points] In lecture, we considered using linear regression for binary classification on the targets {0, 1}. Here, we use a linear model y = w> x + b and squared error loss L(y, t) = 12 (y t)2 . We saw this has the problem that it penalizes confident correct predictions. Will it fix this problem if we instead use a modified hinge loss, shown below? ( t = 0, max(0, y) L(y, t) = t = 1, 1 min(1, y) Justify your answer mathematically. N J ` <_V] M = V q -"9 6 d ( N5 ; 6 ;M 6 ( < < * = = 51 = M J V = 6 51 `* = = V -53 < O 5 < 51 = J Z>[ ; V ( ^.0 ;% (* s ( 6 3*7 < * \ $5,-*52 75() s J "( 1+30(5*3 4% 0"3 = 53(%./.%(%2 Q*2515%2 )53,% ()"( "$$5,3$ 530.%"$%$ & #%$$ #*$$ *. & /.*4"45#5(; "$ ^. O ( * < < M J V J V & 0 ; < < J M V < •€yhV-jh"WVGGV >)% J 3*7 (* J "/.*3$ \t M #*$$ V & Va V < < 6 $* F 6 _M V A* 4 ( 6 6 INJ`&(*b\N#`&(V*M&\`<< ^.%(% n5# ` 5(S TS0 J K w = ; `A J ` = y S W `%* $p+"$)%2 4% F 1*. R] = 51 * W 0"3 ; ` u 1 V 5( h 6 51 51 ; M * < < J b J V = T \N J * 4%)"8%$ 0*..%0( $5-5#". (* /.%250(5*3 9 O.*$$ ;gV( < A3(.*/; 7)%3 < J #*$$ & /.q;gV( M _V] CSC2515 Fall 2020 Midterm Test (b) [5 points] Consider the logistic regression model y = g(wT x), trained using the binary cross entropy loss function, where g(z) = 1+e1 z is the sigmoid function. Your friend proposes a modified version of logistic regression, using g(z) = e z 1+e z The model would still be trained using the binary cross entropy loss. How would the learnt model parameters, as well as the model predictions, di↵er from conventional logistic regression? Justify your answer mathematically. A purely graphical explanation is not sufficient. "; 7(9(4 d & 6yF " & aV N d M 0;$( 1;V V ,0F ( < M V #*,; O# < < < Y A< %% V T #*,* ( M J( " T E ; < < M En k %$ YJ [ ;%(W 6T~ -yV3yhyyDh "; JF V ; M < < V < 1%%( <___\ J -53 Nd5;6( >1 T O*38%3(5*3"# Z".3$ >)% 7 7% W #*,5$(50 < #*,5$(50 c < 6 A`A ~ ; < X%,.%$$5*3 Z".3 < "0(58"(5*3 !JA 6 T ( = V O; ( < T < % 1+30(5*3 10 & 7( V 9 T 9 M 5$ = V %< 6 > IZ & = * N*$$ < NOA -*2 ) 0*-% 6 T T • F ` < T M Y n & !`A O` T < ( T M T 9 6 T .%8%.$%2 (* 0*38%3(5*3"# -*2%# 6 CSC2515 Fall 2020 Midterm Test 6. L1/L2 Regularization [6 points] (a) [2 points] The L1 norm is defined on a vector x as L1 |x| = max({xi : i = 1, 2, . . . , n}). As an example, if x = [ 6, 2, 4], then L1 |x| = 6 since the magnitude of the largest element in x is 6. On the loss plot below, draw the L1 norm and draw a point at the set of weights the regularization term will select. w2 ii7g w1 (b) [4 points] What kind of weights does the L1 norm encourage our models to learn? In which situations might this be useful? ‚ J( GZ` 7-"9 V %30*+.",%$ 8%0(*. J( 5$ 6 "32 +$%1+# (* )"$ 7)%3 75()53 -",35(+2% 53 .%2+0% ()% N".,%$( 0*38*#+(5*3"# 4*+32%2 1*. %11%0( 3* 7% 11 3%%2 ()% C%+."# "3; #%".353, .%$( ()% *3 ()% 0*3$(."53 (* +//%. C%(7*.@$ ."(% 7%5,)( = & 4*+32 & = ()"( 6 53 & 7%5,)( *1 1*. %9"-/#% "## 7%5,)( ()% ()% 6 -"95-+5$ 5( ".% 7%5,)($ +%2 CSC2515 Fall 2020 Midterm Test 7. Classification Decision Boundaries [10 points] 1 -".,53$ ,& 1R " t B"., n 1< F 1F 53 1R 1p (a) [1 point] Assume this is the first step in the boosting algorithm. In the figure above, sketch the decision boundary for the decision stump, optimizing for weighted misclassification error. Label this boundary f1 . (b) [2 points] Select an arbitrary point which f1 incorrectly classifies. Draw a circle around the point you select. What is its weight in the next boosting iteration? Assume the weights initially sum to 1. 7 6 V O1$ 6 5F & & = < 6 6 5( M W %..*. yJ75JJ1509`ƒ( V V `F 75 25 V J #*, Zp 0 7p %9/ w Z, J < %.. W 5 6 ]&o] J _\\ & & J ()% %9 T K$ \e]&o V h e % n 12 & l_ ( ( T K I M V (1 CSC2515 Fall 2020 Midterm Test (c) [1 point] In the figure above, sketch the decision boundary for a 1-layer decision tree, optimizing for information gain. Label this boundary f2 . You do not have to show that the boundary is optimal. (d) [3 points] Compute the information gain of the split f2 . Jr = GA` M V J V J V J < < 1F GO`J < M JA9*(/91<17,A<17$&!J JJ]&o__ 9 ]&lm (e) [1 point] In the figure above, sketch the decision boundary for a decision tree with unlimited depth. Label this boundary f3 . (f) [2 points] In the figure above, sketch the decision boundary for a hard-margin SVM. Label this boundary f4 . Is the boundary unique? You do not have to justify your answer. C* 6 3*( +35p+% & 13 CSC2515 Fall 2020 Midterm Test 8. SVM [9 points] Here are plots of wT x + b = 0 for di↵erent training methods along with the support vectors. Points labeled +1 are in blue, points labeled 1 are in red. The line wT x+b = 0 is shown in bold; in addition we show the lines wT x + b = 1 and wT x + b = 1 in non-bold. Support vectors have bold circles surrounding them. 1) 3) 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −2 −1 0 1 2 3 2) −3 −3 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −3 −2 −1 0 1 2 3 4) −3 −3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 For each method, list all figures that show a possible solution for that method and give a brief explanation. Note that some methods may have more than one possible figure. 14 CSC2515 Fall 2020 Midterm Test (a) [3 points] n X 1 kwk2 + ⇠t s.t. ⇠t 2 t=1 min $*1( = 1. where Q".,53 -"9 & 0, t(i) (wT x(i) + b) W 1 JJ 7-&45F ⇠t t = 1, . . . , n 0 J5 T K = ( < %7&4F 6 07&4"k,5-*A.G7#J<<7-&45&3,A505 J5 T f /%3"#5F53, _M = 1. 4< = P < V 3*( W ' $#"0@$ W 4*+32".; 53 < 0"3 = 1 ⇠t t = 1, . . . , n /"."##%# *.5,53 /"$$ $#"0@$ S %#5-53"(%2 -".,53 -"9 -".,53 %30*+.",53, N".,% $#"0@ 4% M \ 0, t(i) (wT x(i) ) 4*+32".; W * ()% $)51($ 4 .%,+#".5F% 3*3 & = 0"3 -"9 4% %#5-53"(%2 1*. -".,53 a (c) [3 points] where $#"0@$ = #".,% -".,53 %30*+.",% 3*( „ 3*3 n X 1 2 min kwk + ⇠t s.t. ⇠t 2 t=1 min *1 "-*+3($ (b) [3 points] … G+#@ -535-5F53, 3*( K ( < where K n X 1 kwk2 + ⇠t s.t. ⇠t 2 t=1 = 1. 4 $*1( -".,53 V W n K < J W 4*+32".; .%,+#".5F% lM /"$$ 15 J55 1 ⇠t t = 1, . . . , n *.5,53 -".,53 -"95-5F% 3D6 RM 0, t(i) (wT x(i) ) & $#"0@$ = & 55& H"7$ 6 h JQQ 1*. CSC2515 Fall 2020 Midterm Test 9. Gradient Descent [8 points] For a homework assignment, you have implemented a logistic regression model y (i) = (W x(i) + b) You decide to train your model on a multi-class problem using gradient descent. However, before you can turn your assignment in, your friends stop by to give you some suggestions. (a) [2 points] Pat sees your regressor implementation and says there’s a much simpler way! Just leave out sigmoids, and let f (x) = W x + b. The derivatives are a lot less hassle and it runs faster. What’s wrong with Pat’s approach? 1"9 51 75() 4; M 5$ 1+30(5*3 #*$$ N* J 6 M JJ N".,% & "$$*05"(%2 .%$52+"# ()% 6 ".% #".,% -",35(+2%$ *1 /.%250(5*3$ 0*..%0( (* $p+"$)%2 3%( /%3"#5F%2 .%$52+% U < < < < < < < A < < < < < 9 (b) [2 points] Chris comes in and says that your regressor is too complicated, but for a di↵erent reason. The sigmoids are confusing and basically the same answer would be computed if we used step functions instead. 7(9(4 #*$$ F *485*+$ What’s wrong with Chris’s approach? ; W V = •L*7yV21*F G*7%8%. \ N* 6 >F < \ J V %8%.;7)%.% n V r."25%3( = * -%"3$ )"8% "-*+3($ W < < ; = 2%$0%3( 16 3* = 5( 2%153%2 5$ W A,, (* ZH G, W ` J %11%0( 8%.; +/2"(% 5$ 3* < (* < =5 K +$ K( K 6 & 0)"3,53, 7%5,)($ 4; *. 1+30(5*3 VN $-"## */ & $-"## %11%0( = *3 #*$$ ; k CSC2515 Fall 2020 Midterm Test (c) [2 points] Jody sees that you are handling a multi-class problem with 4 classes by using 4 output values, where each target y (i) is actually a length-4 vector with three 0 values and a single 1 value, indicating which class the associated x(i) belongs to. Jody suggests that you just encode the target y (i) values using integers 1, 2, 3, and 4. What’s wrong with Jody’s approach? ; >".,%($ *3% < N A )*( ; 7)%3 A q V \ 6 8%0(*.$ < ; 51 J R 6 6 n$ & & & * 6 J 6 ] 6 = 6 @ %3(.; "32 #%( 0"#0+#"(53, "$$5,3 7%5,)($ .%/.%$%3(%2 5$ +$53, & < ; \ 6 R 6 lM 6 *1 %#%-%3($ = (".,%( 75## 6 *1 53$(%"2 5($ #*, (* O# V ()% N0% _ 5$ = M ] 6 < < < __\l M B & & & 6 (*,,#53, = ()%- *3 "32 *11 & (d) [2 points] Evelyn overhears your conversation with Jody and suggests that you use as a target value y (i) a vector of two output values, each of which can be 0 or 1, to encode which one of four classes it belongs to. What is good or bad about Evelyn’s approach? ?53".; 0*253, ; r**2$ S K T 5($ J V a 6 `F 25-%3$5*3 M ; @ * V * `F _$( * < 5$ 6 Sa DS 1"0(*. .%*,+"p&4" 4 "2$ S )".2 (* 53 5-/#%-%3( */(5-5F"(5*3 "#,*.5()- -535-5F% (* N%% K 7)%.% NdV B < ; K O#*, 17 J M & = < R 6 0#"$$ 6 l() 0#"$$ CSC2515 Fall 2020 Midterm Test 10. Maximum Likelihood and Maximum A Posteriori [11 points] Your company has developed a test for COVID-19. The test has a false positive rate of ↵, and a false negative rate of . (a) [2 points] Assume that COVID-19 is evenly distributed through the population, and that the prevalence of the disease is . What is the accuracy of your test on the general population? (%$( M 0*325(5*3 /*$5(58% q 0*325(5*3 3%,"(58% ; < 51 1"#$% 3%,"(58% H;(.+% C%,"(58% 1"#$% /*$5(58% 6 /.% "00+."(% /*$5(58% (.+% < & J < 1M ; q ( V " q ( J P M < T < ( M { (b) [2 points] Your company conducts a clinical trial. They find n people, all of whom they know have COVID. What is the likelihood that your test makes n+ correct predictions? nPJa < = K 3 t )( /*$5(58% 6 … )( N 3%,"(58% & q V J / /6 < 3 < )( (c) [5 points] Derive the maximum likelihood estimate for . You may assume all other parameters are fixed. )( Vm,l^V03 C`O N #5% u J ? M )( J < K GJ> q 3 (1 < 3 < )( < )( C11 )(( 3/( V V )53( 3 18 U 1* < V M 1 #*, J%( < 1 V * 1 3 < )( O# )( < 1 M (*,"1 M < CSC2515 Fall 2020 Midterm Test (d) [4 points] Derive the MAP estimate for , assuming it has a prior P ( ) = Beta(a, b). You may assume all other parameters are fixed. < & † 35 /53 .5 < N5/$ 6 V /6 & ^5/ aM J ".,-"/9 ".,-,9 V ^Aa -"/ V V /5/ M #*, /%/ ( M ##0*,1 n < < J1 ^0* aM 6 V < "<((3/<J<45&JQ " < J( ) ‡ )( < < #" < "< V J(3 3 J( "(4<\k 19 )( < < & O4 ;0*,0 U #" ".,-/"9 #5/ M #*, < V M ^OaJ1 < M )( 1 < q4 < = < 1M J(3 M 3((#*,1(3(#*,0( (%3 < 1 < < * < 1 M CSC2515 Fall 2020 Midterm Test 11. Principal Component Analysis [11 points] Suppose that you are given a dataset of N samples, i.e., x(i) 2 RD for i = 1, . We P2, ..., N (i) say that the data is centered if the mean of samples is equal to 0, i.e., N1 N x = 0. i=1 For a given unit direction u such that kuk2 = 1, we denote by Pu (x) the Euclidean projection of x on u. Recall that projection is given by Pu (x) = u> xu (a) [3 points] Mean of data after projecting on u: Show that the projected data with samples Pu (x(i) ) in any unit direction u is still centered. That is show, n 1 X Pu (x(i) ) = 0. N i=1 C J( "( h T ^"59 M V +( 9K T + 6 6 :> V :> V V 20 qA[ J3 n & n C T 5( & : M :> "32 : w 0)"3,% + 75() 2*3K( 5 & CSC2515 Fall 2020 Midterm Test (b) [4 points] Maximum variance: Recall that the first principle component u⇤ is given by the largest eigenvector of the sample covariance (eigenvector associated to the largest eigenvalue). That is, N 1 X (i) (i) > u⇤ = argmax u x [x ] u. N i=1 u : kuk2 =1 > Show that the unit direction u that maximizes the variance of the projected data corresponds to the first principle component of the data. That is, show N X u⇤ = argmax Pu (x ) u : kuk2 =1 i=1 AN E530% 9 P".1%+(9 sV * V 51 '^> / () } M 9 6 J +( ^)/ V V K> !A 9 J(+ T =K K$ x ^"59 J( GNA[J[J> V V/ / K$ sV >[> < 6 5$ %5,%38%0(*. = -"(.59 = T ^"59 M + V "32 V* ' 6 "(." < V 6 T [5 [5 V 7)%.% +( E530% s O[ O*. J T < N 2 1 X Pu (x(j) ) . N j=1 2 (i) 5$ = " $ ;*+ J"5 6 < < " 6 T JX A O9 M -"(.59 %5,%38"#+% > P".1/+#9#( ^+( A 9 T M & 5 JJJ& J GA k ^+0 6 J J V H%( JJJ >1 T :> + 9K A[ T "32 "( 7)%.% : >+ M K= = K +( = V T T J M + + & :> =K $0"#".$ ".% W >+ 9 T K V & I+ 6 :> :> V A + & + & < :> +( V V ^O' = & "(." + & + "( $ " 6 >."53 ".,-"+9 1(/&055,&5pv1^+095`G5MV/ W ?; :> ^"/5 + ".,-"+9#A:>[> M P". +gV E* g + V ".,-"+91 21 " & (*/ 6 V/ SM &(8v^+ K## A9/"32 K$ Va 1"+#(; V T T 6 J *1 6 %5,%38%0(*. \ CSC2515 Fall 2020 Midterm Test (c) [4 points] Minimum error: Show that the unit direction u that minimizes the mean squared error between projected data points Pu (x(i) ) and the original data points x(i) corresponds to the first principal component u⇤ . That is show, N X u⇤ = argmin kx(i) u : kuk2 =1 i=1 j€ /+19 T #%( K M :> V */(5-"# -%%($ +g -53(9555 A52 555& K i < < < < < 6 + K J 6 T 5$ = (1) ^+0 .%0*3$(.+0(5*3 *1 9 T M 0(5(%.5"3 -*9" ()% ?; 9K T Pu (x(i) )k22 . /;()",*.%"3 ()% v = 6 "32 >JA ( K$ !ANJ[ K$ -"9 ()%*.%- T ,58%3 } V n V-"eKv 6 8".#%( +# v J$ T J < AG GJQ (7* < QA h J&A < E* e M Q## v " 0*3$("3( < & 8".59 < < V 3 < \ T T ",-&53v&&&-55&/79536#)<<".,-"+9!,##^+ T C ".,-"9 V W = < J +(+( ".,-+"9 V / /6 7)%.% 75() 5$ ()% "( w T = J + + ^'^ >+ (*/ %5,%38"#+% J V T + " +g>.+g %5,%38%0(*. 6 9K5$ ".% = `++ !AJ :>:>^'^>:: W 6 M KK K K ".,-"+9 V + 9K 9K :> :> g +g 22 < 8555& ".,-"+9 V 9KT K## )5 J3 M =K T J 5 V +( + +()59 C ".,-"+9 V 9 + & 5( + : J >5 T < J K## $ *1 " A 6 T > V = 9K $0"#".$ : CSC2515 Fall 2020 Midterm Test 12. Probabilistic Models [15 points] We would like to build a model that predicts a disease D in a patient. Assume we have two classes in our label set t: diseased and healthy. We have a set of binary patient features x = [x1 , ..., xN ] that describe the various comorbidities of our patient. We want to write down the joint distribution of our model. (a) [4 points]W hat is the problem in calculating this joint distribution? How many parameters would be required to construct this model? How does the Naive Bayes model handle this problem? /09 M a J 5$ 7)50) .%p+5.%$ C"58% ?";%$ AA \ 25.%0(#; $/%051; = = 0*-/"0(#; 8".5"(%$ (."0("4#% 25$(.54+(5*3 /59 J( M & (b) [3 points] Write down the joint distribution under the Naive Bayes assumption. How many parameters are specified under the model now? ^ O ( ^.5*. 6 [5 6 < 6 < Ra 9/ < 6 /.*4"45#5(; O*325(5*3"# n3#; 9F 5 /.*4"45#5(; ( J M V ^. /%( M O( < < /%3"#(; & J M V /"."-%(%.$ 23 53 & & n1 /09,&#( M S & (*("# V nH% /09/#( M 6 & .%/.%$%3(%2 4*() ".% 0*325(5*3 +32%. 532%/%32%3( ".% %"(53, "32 531%.%30% W 4% ."32*- 453".; S s 0"3 a *8%. (* %3(.5%$ A%9H $(.+0(+.% W T '$$+-%$ "32 [5 J 25$(.54+(5*3 H*53( " CSC2515 Fall 2020 Midterm Test (c) [4 points] Why does the Naive Bayes assumption allow for the model parameters to be learned efficiently? Write down the log-likelihood to explain why. Here, assume that you observed N data points (x(i) , t(i) ) for i = 1, 2, ..., N . K J5 )*( #*, /%( T & =K = J JJ& V = 6 6 !A V T 7,/0( 6 T /%( )*, /09, 6 T K T J( J M (&!Jv~#*,^09H5K J( J ( T M 6 1"4 ^. V r*"# [H O *3 * & = = & 6 = " V 4M V ]#4 < ".,* v 4% = "##*7$ ()% #%".3%2 %11505%3(#; JJ T M #*, #5@%#; < 9H7,*&*(&JA & *3 6 " < %KT @5 = < K P*,%# < r* M 6 )**2 = /"."-%(%.$ 6 53(* 5$ %"0) 4%0"+$% + 25$(.54+(5*3 1"0(*.5F%$ 2%0*-/*$%2 1%"(+.% 75() = & K &"%v&5AAK9H5)*,*3(&&3A<5@5<9H5j7,+< = U ?";%$ = "$$+-/(5*3 0*325(5*3"# n*4 < ".,*&-&,,vJ&7,/09H5K J( V V C"58% J & = 532%/%32%3( 0"3 (%.-$ = & (d) [1 points] Show how you can use Bayes rule to predict a class given a data point. AJ /09H#( ^ /0(#9 M a%153% A& , /%($ V >%% = " ; = & g = /%( K$ ^O[G J 9 J N5;" s /%( 6 "32 6 N$% 1*. JJ K K AA <l V 6 V < 1+30(5*3 #*$$ /%( M /09#( K$ K /59H#( M 6 0)**$% JO` ( M < M J> = ; k V = ".,-5;3 A KNN J> &(# J e < < < (e) [3 points] Explain how placing a Beta prior on parameters is helpful for the Naive Bayes model. /.5*. ()*+,)( /$%+2* "$ 0*+3($ < ^*$(%.5*. & 25$(.54+(5*3 /"."-%(%. 75## /.5*. "$ A& *1 = ?%(" ()"( 0*3H+,"(% /.5*. 5$ , O*53 & 1#5/ ^0* Ja M /#*( ' ' V /# & x ," n " < < aJn */ 4 ( 0 6= J = ( M < < s J 4 CG O J < n 24 M x nC & < J >C> T O)* 6 C> s )"8% = ()% " "32 4 $"-% 1*.- 0"3 4% s CSC2515 Fall 2020 Midterm Test 13. K-means [12 points] In the following plots of k-means clusters, circles with di↵erent colors represent points in di↵erent clusters, and stars represent cluster centers. For each plot, describe the problem with the data or algorithm, and propose a solution to find a more appropriate clustering. (a) [2 points] "32 ,.%%3 .%2 $(". 53 #*0"# #*0"# 4"2 E*#+(5*3 */(5-" >.; 3*3 < #*0"# /*53( $/#5( "32 >)% B " 45, )"8% -%"3$ < .+3353, @%%/ "#,*.5()- ()% J1 53 & ()% #%"$( 0#+$(%. >%$( ()% -*8%$ -%.,% E5-+#("3%*+$#; -%.,% $/#5( 25 ".% & $(".( ."32*- < (b) [2 points] ()%; "32 6 S z >.; | -535-" $(+0@ ".% (7* 53(* (7* & 0*38%.,%2 ;%( 3*( 1*. 0*38%.,%30% 2*%$ "$$5,3-%3( $(%/ )"8% 0*38%.,%2 6 (* S 3*( 0)"3,% ()%3 " 6 .%(5(#53, "32 "$$5,3-%3( "$$5,3-%3( "32 0#+$(%.$ 3%".4; 7% #*0"# "( -535-+- & CSC2515 Fall 2020 Midterm Test (c) [2 points] );/%. /"."-%(%. & *1 3+-4%. #".,% .%$+#( E*#+(5*3 (* 53 = $%( 2"(" ()% 53 %9"-/#% *1 3+-4%. 2%(%.-53% = X%2+0% & 75## @ .%$+#($ 0#+$(%.53, 25",3*$(50 (** 5$ 0#+$(%.$ /**. .+3 S 6 */(5-"# 3*( 6 B 6 0)%%@ 0#+$(%.$ B 53 & (d) [3 points] The k-means algorithm is designed to reduce the mean squared error between the cluster centers and their assigned points. Will increasing the value of k ever increase this error term? Why or why not? "#7";$ B 530.%"$% C* 6 EE! >)% *1 3+-4%. 25$(*.(5*3$ .%2+0%$ EEA 75()53 0#+$(%. %$$%3(5"##; 5$ J30.%"$53, .%2+0%$ "32 ()+$ 6 8".5"30% & "#7";$ "8%.",% (* 0#+$(%.$ 6 *12"(" 8".5"30% (e) [3 points] Suppose we achieve a very low error with a large value of k, and a higher error with a smaller value of k. Should we select the k value with the lowest error? Why or why not? Cn 6 ()"( #*7 *1 .+#% EEA %..*. 0#+$(%. (5## ()+-4 2%0.%"$%$ 4+( 6 = = (* = B /+./*$% *1 0#+$(%.53, 1#"((%3$ (* */(5-"# *8%. /*53( /%. ,.*+/ %#4*7 +$% "4.+/(#; "32 530.%"$53, *3% 5$ 0#+$(%. $5-5#". 26 53 -%()*2 *+( 8"#+% %9(.%-% 2"(" & N".,% 1+.()%. 0"$% 1"5#%2 & 1532 (* B ,+"."3(%%$ 25852%$ 7)%.% @ ()% ()% ()5$ & CSC2515 Fall 2020 Midterm Test 14. Expectation-Maximization [8 points] Here we are estimating a mixture of two Gaussians via the EM algorithm. The mixture distribution over x is given by P (x; ✓) = P (1)N (x; µ1 , 2 1) + P (2)N (x; µ2 , 2 2) Any student in this class could solve this estimation problem easily. Well, one student, devious as they were, scrambled the order of figures illustrating EM updates. They may have also slipped in a figure that does not belong. Your task is to extract the figures of successive updates and explain why your ordering makes sense from the point of view of how the EM algorithm works. All the figures plot P (1)N (x; µ1 , 12 ) as a function of x with a solid line and P (2)N (x; µ2 , 22 ) with a dashed line. (a) [2 points] (True/False) In the mixture model, we can identify the most likely T posterior assignment, i.e., j that maximizes P (j | x), by comparing the values of P (1)N (x; µ1 , 12 ) and P (1)N (x; µ2 , 22 ) "$$5,3%2 4; C%9 D} D */ M %9"0(#; 5$ >X:A 6 ()"( -"95-5F%$ /*$(%.5*. >)% = 27 /095#95%@H R & 6 CSC2515 Fall 2020 Midterm Test (b) [2 points] Assign two figures to the correct steps in the EM algorithm. - Step 0: ( ") initial mixture distribution - Step 1: ( O) after one EM-iteration (c) [4 points] Briefly explain how the mixture you chose for “step 1” follows from the mixture you have in “step 0”. J35(5"#5F"(5*3 r**2 A < $(%/ S #%( /."0(50% %"0) *1 '$$5,3 .%$/*3$545#5(; .%$/*3$545#5(; 1*. K $(%/ & 2"(" " A J J 9 A j&m&cR * 6 J L -"95-5F% (* 8".5"30%$ '$ 2"(" r"+$$5"3 0*-/*3%3( +/2"(% ()5$ 0*-/*3%3( 1*. [ 1*. Q -%"3 = = S ."32*- )"8% 0#+$(%. .%$+#( 75() $%( 7%5,)(%2 ()% /.*4"45#5(; [A Q (* 6 $*#52 J (* $)51($ 28 >$*#52 5( *1 4%0*-%$ M ,%3%."(% ()% = ()% 5$ N53% -+0) #".,%. 1". N%1( #%1( 5$ 5( 2"(" (* "32 L _j6mc [A (* #53% $/"3 O53% "32 & -*8%$ "32 !#*+(%2 -*2%# #53% -%"3$ (* #53% $*#52 (* 2*((%2 (* "$$5,3%2 2%85"(5*3 /*$(%.5*. /.*4"45#5(; +$53, 75() = $("32".2 "$$5,3%2 )5,) 5$ 5$ 0+..%3( Q $*#52 & 4.*"2 "32 = 0#*$%. (* = ()% .%$/*3$54#% "00*+3( 5($ & $5-+#("3%*+$#; 1*. -%"3 & CSC2515 Fall 2020 Midterm Test Bonus [+1 Point] Guess the percentage grade that you will receive on this examination out of 100% (excluding this bonus question and any formatting penalty). You will receive 1 point if your guess is within 5% of your actual mark. o]€ Survey Questions The following survey questions are completely optional, and will not impact your grade in any way. By answering these questions, you give permission for your data to be used in aggregate descriptive statistics. Results from these analyses will only be disseminated to students in the class. A1. I have previously taken an “Introduction to machine learning” or equivalent course (at any university). n Yes No A2. My home department for graduate studies is: CS ECE Other: QJA A3. I would describe the overall difficulty of the course so far as: Very easy Easy Average 29 Hard n Very hard