10 mars 2019 1 / 30 Plan 1 the Fixed point algorithms 2 proximal operator 3 primal-dual problem 4 Alternating Direction Method of Multipliers 10 mars 2019 2 / 30 the Fixed point algorithms Recall : (Fixed point) : Let f : X → R a function continues on X. We say that α is a fixed point of f if f (α) = α. (Fermat’s rule) : Let f a proper function and x ∈ X . We have the equivalence : x ∈ argminf ⇔ 0 ∈ ∂f (x ).In other words argminf = zer ∂f = x ∈ X /0 ∈ ∂f (x ) 10 mars 2019 3 / 30 the Fixed point algorithms Objective : We want to calculate numerically a minimizer of a convex function f : X →]∞, +∞]. According to Fermat’s rule you have to find a point x such as 0 ∈ ∂f (x ). The case where f derivable, this is equivalent to finding a zero of the gradient 0 = ∇f (x ). In this case a popular algorithm is the gradient algorithm, consist tô generate a suite (x k : k = 0, 1, ...) defined by : x k+1 = x k − γ∇f (x k ). We can note T (x ) = x − γ∇f (x ) and the algorithm becomes in form x k+1 = T (x k ) such as any fixed point of T is a minimizer of f. There are many optimization algorithms written in this form where T is an application chosen so that his fixed points are solutions to the problem. It is therefore necessary to highlight conditions on T which guarantee the convergence of the algorithm to a fixed point. 10 mars 2019 4 / 30 α-averaged applications Notation Let L > 0 and R, T : X → X two applications. The image of x by R is noted R(x) or Rx. The TR compound will also be noted more compactly TR. We note the identity (I(x ) = x ) by I. 10 mars 2019 5 / 30 α-averaged applications Notation Let L > 0 and R, T : X → X two applications. The image of x by R is noted R(x) or Rx. The TR compound will also be noted more compactly TR. We note the identity (I(x ) = x ) by I. Definition 1 : Application R is said L-Lipschitz if ∀(x , y ) ∈ X 2 : ||Rx − Ry || ≤ L||x − y ||. If L < 1, we say that R is a contraction and if L=1, R so-called non-expansive. 10 mars 2019 5 / 30 α-averaged applications Definition 2 : Let α ∈ [0, 1] Application T is said α-averaged if there’s a non-expansive R 1 application such as T = αR + (1 − α)I. Application -averaged is said 2 firmly non-expansive. 10 mars 2019 6 / 30 α-averaged applications Definition 2 : Let α ∈ [0, 1] Application T is said α-averaged if there’s a non-expansive R 1 application such as T = αR + (1 − α)I. Application -averaged is said 2 firmly non-expansive. proposition Let α ∈ [0, 1], the following assertions are equivalent : 1.T is α-averaged. 2.-∀(x , y ) ∈ X 2 : ||Tx − Ty ||2 ≤ ||x − y ||2 − α||(I−T(1−α) )x −(I−T )y ||2 10 mars 2019 6 / 30 α-averaged applications proposition Let T : X → X application such as ∀(x , y ) : hTx − Ty , x − y i ≥ ||Tx − Ty ||2 . so T is firmly non-expansive. 10 mars 2019 7 / 30 α-averaged applications proposition Let T : X → X application such as ∀(x , y ) : hTx − Ty , x − y i ≥ ||Tx − Ty ||2 . so T is firmly non-expansive. lemma Let T and S two applications : X → X respectively α-averaged and β -averaged, where 0 < α, β < 1. ∃δ, 0 < δ < 1 such as TS is δ -averaged. 10 mars 2019 7 / 30 Cocoercive functions Definition A function B : X → X is called µ-cocoercive (µ > 0) if µB is firmly non-expansive. The definition means that : ∀(x , y ) ∈ X 2 , hBx − By , x − y i ≥ µ||Bx − By ||2 . 10 mars 2019 8 / 30 Cocoercive functions Definition A function B : X → X is called µ-cocoercive (µ > 0) if µB is firmly non-expansive. The definition means that : ∀(x , y ) ∈ X 2 , hBx − By , x − y i ≥ µ||Bx − By ||2 . proposition Let B a function µ-cocoercive and 0 < γ ≤ 2µ. So I − γB est γ -averaged. 2µ 10 mars 2019 8 / 30 Cocoercive functions Theorem (Krasnosel’skii Mann) Let 0 < α < 1 et T application α -averaged such as fix (T ) 6= ∅. Any sequence x k satisfying the recursion x k+1 = T (x k ) converges to a fixed point of T. 10 mars 2019 9 / 30 Cocoercive functions Theorem (Krasnosel’skii Mann) Let 0 < α < 1 et T application α -averaged such as fix (T ) 6= ∅. Any sequence x k satisfying the recursion x k+1 = T (x k ) converges to a fixed point of T. Corollary Let B a function µ-cocoercive such as zerB 6= ∅. Let 0 < γ < 2µ. So every sequence (x k )k satisfying x k+1 = x k − γB(x k ) converges to a point of zerB. 10 mars 2019 9 / 30 Example : Gradient algorithm Hypothesis Let f : X →] − ∞, +∞] is convex, derivable on X and Of is L-Lipschitz. 10 mars 2019 10 / 30 Example : Gradient algorithm Hypothesis Let f : X →] − ∞, +∞] is convex, derivable on X and Of is L-Lipschitz. Theorem (Baillon-Haddad) Under hypothesis ∀(x , y ) ∈ X 2 , hOf (x ) − Of (y ), x − y i ≥ L−1 Of is firmly non-expansive. 1 ||Of (x ) − Of (y )||2 . particularly L 10 mars 2019 10 / 30 Example : Gradient algorithm Hypothesis Let f : X →] − ∞, +∞] is convex, derivable on X and Of is L-Lipschitz. Theorem (Baillon-Haddad) Under hypothesis ∀(x , y ) ∈ X 2 , hOf (x ) − Of (y ), x − y i ≥ L−1 Of is firmly non-expansive. 1 ||Of (x ) − Of (y )||2 . particularly L Theorem We assume that the hypothesis is satisfied and that argminf 6= ∅. Let 2 0 < γ < . Any sequence x k satisfying the recursion L x k+1 = x k − γOf (x k ) converges to an minimizer of f. 10 mars 2019 10 / 30 proximal operator Definition: Given a function f : E →] − ∞, +∞] the proximal operator (proximal mapping) of f is the operator given by : 1 proxf (x ) = argminy ∈E {f (y ) + ky − x k2 } 2 the schaled proximal operator is defind by : proxγf (x ) = argminy ∈E {f (y ) + 1 ky − x k2 } 2γ 10 mars 2019 11 / 30 proximal operator Definition: Given a function f : E →] − ∞, +∞] the proximal operator (proximal mapping) of f is the operator given by : 1 proxf (x ) = argminy ∈E {f (y ) + ky − x k2 } 2 the schaled proximal operator is defind by : proxγf (x ) = argminy ∈E {f (y ) + 1 ky − x k2 } 2γ remark 1 Takes a vector x ∈ E , the argmin{f + k. − x k} is a set , wich might be 2 empty , a singleton , or a set with multiple vectors for this we loock the cas where the operator is well defined. 10 mars 2019 11 / 30 proximal operator proposition(*): Let f ∈ Γ0 (E ) , So : p = proxf (x ) ⇔ x ∈ p + ∂f (p) proxf is well defined as the application of E to E for any x , y ∈ E : hx − y , proxf (x ) − proxf (y )i ≥ kproxf (x ) − proxf (y )k2 10 mars 2019 12 / 30 proximal operator Exampls: 1 proxc (x ) = argminy ∈E {c + ky − x k2 } = x 2 1 proxh.,ai+b (x ) = argminy ∈E {hy , ai + b + ky − x k2 } = x − a 2 1 1 prox (x ) 1 = argminy ∈E { hx , Ax i − hb, x i + ky − x k2 } 2 2 hx ,Ai−hb,x i 2 = (A − I)−1 (x − b) 10 mars 2019 13 / 30 proximal operator proposition: Lets f1 ......fn functions of Γ0 (E ). for any x = (x1 , ., ..., xn ) ∈ E ; posed f (x ) = f1 (x1 ) + f2 (x2 ) + ..... + fn (xn ) : proxf (x ) = (proxf1 (x1 ), proxf2 (x2 ), .., .., proxfn (xn )) 10 mars 2019 14 / 30 proximal operator proposition: Lets f1 ......fn functions of Γ0 (E ). for any x = (x1 , ., ..., xn ) ∈ E ; posed f (x ) = f1 (x1 ) + f2 (x2 ) + ..... + fn (xn ) : proxf (x ) = (proxf1 (x1 ), proxf2 (x2 ), .., .., proxfn (xn )) theorem: Let the point x ∗ ∈ R n is a minimizer of a proper closed convex function f if and only if ; x ∗ = proxf (x ) 10 mars 2019 14 / 30 Proximal gradient algorithm in this part we are intersted in two functions f and g , such that f is convex , derivable and its derivation is lipschitzienne . according to the above ,try to minimize the sum f + g is to loock for on element x ∗ such that 0 ∈ ∂(f + g)(x ∗ ) = ∇f (x ∗ ) + ∂g(x ∗ ). which is equivalent to write −∇f (x ∗ ) ∈ ∂g(x ∗ ) or −∇f (x ∗ ) ∈ ∂g(x ∗ ) ⇒ x ∗ − ∇f (x ∗ ) ∈ x ∗ + ∂g(x ∗ ) Let the point x ∗ ∈ R n is a minimizer of a proper closed convex function f if and only if ; x ∗ = proxf (x ∗ ) so according to the above ,x ∗ − ∇f (x ∗ ) ∈ x ∗ ∂g(x ∗ ) ⇔ x ∗ = proxg (x ∗ − ∇f (x ∗ ) . can extand the remark ,observing that there is identity between the minimizers ,of f + g and the minimizers of γf + γg for all γ ≥ 0. in other words we have : 10 mars 2019 15 / 30 Proximal gradient algorithm proposition: Lets f , g ∈ Γ0 (E ) , Two functions , such that f derivable and its dirivation is lipshtizienne . so x ∗ ∈ argmin(f + g) if and only if x ∗ = proxγg (x ∗ − γ∇f (x ∗ )) the above proprite suggests the following algorithm ,called the proximal gradient algorithm : x k+1 = proxγg (x k − γ∇f (x k )) theorem: Lets f , g ∈ Γ0 (E ), Two functions such that f derivale and its ∇f 2 L-lipschitzienne for 0 < γ < ,any suite x k satisfying the previous L recutsion converges to a minimizer of f + g. 10 mars 2019 16 / 30 application Let c ⊂ E be a closed convex set , we are interested in the problem : infx ∈C f (x ) (∗) we define the indicator function ic of the set c by : ( Sic (x ) = 0 ,x ∈ C (1) +∞ , if not we verify immediately that : proxic (x ) = pc (x ) so the gradient algorithm proximal is given by : x k+1 = PC (x k − Oγf (x k )) : becouse the problem (*) is equivalent to infx ∈E f (x ) + ic (x ) under the hypotheses of the theorem , this algorithm convergs to an minimizer of f + ic ,that is to say an minimizer of f on c 10 mars 2019 17 / 30 Iterative soft-thresholding we put E = Rn and we are interested to the problem : infx ∈E {f (x ) + ηkx k1 } where kx k1 is the norm l1 of the vector x defined by kx k1 = |x1 | + |x2 | + .... + |xn |. for allx = (x1 , x2 , ..., , xn ) we define the function proposition: The function proxη|x |1 coincides with the function called soft-thresholding defined for any x ∈ R by : Sη (x ) = x − η , for x > η 0 for x ∈ [−η, η] (2) x + η for x < η 10 mars 2019 18 / 30 Iterative soft-thresholding proposition: in this case, the proximal gradient algorithm takes the forme : x k+1 = proxγη|x |1 (x k − γ∇f (x k )) we put y k = x k − γ∇f (x k ) so xik+1 = Sγη (yik ) (∀i = 1, , , n) if f derivable convex and ∇f L-lipschitzienne then this iteratif converge to a miminizer of f + ηk.k1 10 mars 2019 19 / 30 Monotone operators remark: In previous chapters, we have seen that the operator proximal proxf of a function f ∈ Γ0 (X ) and the point p that satisfies : x ∈ p + ∂f (p) 10 mars 2019 20 / 30 Monotone operators proposition (*) ensures this inclusion x ∈ p + ∂f (p) well defines a single point p, the resulting proxf application is firmly non-expansive. In re-reading the proof of proposition(*), we see that this result comes of the next property of the sub-differential, called the monotonic property : ∀(u; v ) ∈ ∂f (x ) × ∂f (y ); hu − v , x − y i ≥ 0 The purpose of this paragraph is to extend the proposal to applications A : X → P(X ) which are not necessarily written as sub-differentials, but always trust the property of monotony. For such applications, we are able to extend the notion of proximal operator. 10 mars 2019 21 / 30 Monotone operators definition the operator A : X → P(X ) is monotone if the next proposal is true for all (x ; y ) → P(X )2 : ∀(u; v ) ∈ A(x ) × A(y ); hu − v ; x − y i ≥ 0 10 mars 2019 22 / 30 Monotone operators definition the operator A : X → P(X ) is monotone if the next proposal is true for all (x ; y ) → P(X )2 : ∀(u; v ) ∈ A(x ) × A(y ); hu − v ; x − y i ≥ 0 proposition: for all A : X → P(X ) , we denote A−1 the application that has all x ∈ X associates : A−1 (x ) = {y ∈ X : x ∈ A(y )} In other words, we have equivalence y ∈ A−1 (x ) ⇔ x ∈ A(y ). 10 mars 2019 22 / 30 operation on sub-differentials proposition: Let f : X →]∞, +∞] and g :Y →] − ∞, +∞] be convex functions and M ∈ L(X , Y). We think that 0 ∈ ri(Mdomf − domg) (3) ∂(f + g ◦ M)(x ) = ∂f(x ) + M ∗ ∂g(Mx ). (4) So 10 mars 2019 23 / 30 primal-dual problem Position of the problem Let X and Y be two Euclidean spaces and let M : X → 7 Y be a linear operator. Given two real convex functions f and g on X . We conseder the minimization problem : infx ∈X f(x ) + g(Mx ) (5) A minimizer of(5) is called an optimal primal point. Under the hypothess of proposition(.),finding an optimal primal point is equivalent to find x such that 0 ∈ ∂f(x ) + M ∗ ∂g(Mx ) 10 mars 2019 24 / 30 primal-dual problem by definition, x is optimal primal if and if there exists λ ∈ ∂g(Mx ) such that 0 ∈ ∂f(x ) + M ∗ λ By a simple rewrite game, the condition λ ∈ ∂g(Mx ) returns to Mx ∈ (∂g)−1 (λ).finally, in order to find a point x optimal-primal, just find a couple(x , λ) ∈ X × Y such that ( 0 ∈ ∂f(x ) + M ∗ λ 0 ∈ −Mx + (∂g)−1 (λ) (6) the problem is rewritten : ( 0 ∈ ∂f(x ) + M ∗ λ 0 ∈ −Mx + ∂g ∗ (λ) (7) the problem is sometime called the primal-dual problem. 10 mars 2019 25 / 30 Alternating direction method of multipliers dual ascent Consider the problem minx f (x ) subject to Ax = b where f is strictly convex and closed. Denote Lagrangian : L(x , u) = f (x ) + u T (Ax − b) Dual gradient ascent repeats, for k = 1, 2, 3, . . . x (k) = argminx L(x , u (k−1) ) u (k) = u (k−1) + tk (Ax (k) − b) 10 mars 2019 26 / 30 Alternating direction method of multipliers Augmented Lagrangian method considers the modified problem, for a parameter ρ > 0, ρ minx f (x ) + kAx − bk22 subject to Ax = b 2 uses modified Lagrangian ρ Lρ (x , u) = f (x ) + u T (Ax − b) + kAx − bk22 2 and repeats, for k = 1, 2, 3, . . . x (k) = argminx Lρ (x , u (k−1) ) u (k) = u (k−1) + ρ(Ax (k) − b) 10 mars 2019 27 / 30 Alternating direction method of multipliers Alternating direction method of multipliers or ADMM : combines the best of both methods. Consider a problem of the form : minx ,z f (x ) + g(z) subject to Ax = b We define augmented Lagrangian, for a parameter ρ > 0, ρ Lρ (x , z, u) = f (x ) + g(z) + u T (Ax + Bz − c) + kAx + Bz − ck22 2 We repeat, for k = 1, 2, 3, . . . x (k) = argminx Lρ (x , z (k−1) , u (k−1) ) z (k) = argminx Lρ (x k , z, u (k−1) ) u (k) = u (k−1) + ρ(Ax (k) + Bz (k) − c) 10 mars 2019 28 / 30 Alternating direction method of multipliers Connection to proximal operators Consider minx f (x ) + g(x ) ⇔ minx ,z f (x ) + g(z) subject to x = z ADMM steps : x (k) = proxf ,1/ρ (z (k−1) − w (k−1) ) z (k) = proxg,1/ρ (x (k) + w (k−1) ) w (k) = w (k−1) + x (k) − z (k) where proxf ,1/ρ is the proximal operator for f at parameter 1/, and similarly for proxg,1/ρ . In general, the update for block of variables reduces to prox update whenever the corresponding linear transformation is the identity 10 mars 2019 29 / 30 10 mars 2019 30 / 30