Course Notes: Convex Optimization

The following notes are heavily based on material developed by Professor Stephen Boyd for EE364 taught at Stanford University. All typos are my own.

Optimization Problems

Our goal is to solve problems of the following form: \[\begin{equation}\begin{aligned}\text{minimize } &f(x) \\ \text{subject to } &g_i(x) \leq 0, i = 1,\dots, m \\ &h_j(x) =0, j = 1, \dots, n\end{aligned}\end{equation}\tag{OptPrb}\] where

The $g_i$ and $h_j$ are the explicit constraints. However, there is also the implicit constraint that $x$ must be within intersection of the domains of $f, g_1 \dots, g_m, h_1, \dots, h_n$, i.e., \[\mathcal{D} := \text{dom}(f) \cap \bigcap_{i=1}^{m}\text{dom}(g_i) \cap \bigcap_{j=1}^{n}\text{dom}(h_i).\]

Def. Feasible Point: A point $x \in \mathcal{D}$, is feasible for (OptPrb) if it also satisfies its explicit constraints, i.e., iff \[g_i(x) \leq 0, i = 1, \dots, m,\] and \[h_j(x) = 0, j = 1, \dots, n.\]

We will use $\mathcal{X}$ to denote the set of feasible points. Note that $\mathcal{X} \subseteq \mathcal{D}$.

Def. Feasible Direction: Given a feasible point, $x \in \mathcal{X}$ for (OptPrb), a direction, $v \in \mathbb{R}^n, \|v\|_2 = 1$ is feasible if $\exists T > 0$ for which $x + tv \in \mathcal{X}$ for all $0 \leq t \leq T$.

We will use $\mathcal{V}(x)$ to denote the set of feasible directions from a feasible point, $x$.

Def. Globally-Optimal Point: A feasible point, $x^* \in \mathcal{X}$, is globally optimal for (OptPrb) if \[f(x^*) = \underset{x \in \mathcal{X}}{\text{inf}}\big\lbrace f(x) \big\rbrace.\]

Def. Locally-Optimal Point: A feasible point, $x \in \mathcal{X}$, is locally optimal for (OptPrb) if there exists an $R > 0$, such that $x^*$ is globally optimal for the optimization problem: \[\begin{aligned}\text{minimize } &f(x) \\ \text{subject to } &g_i(x) \leq 0, i = 1,\dots, m \\ &h_j(x) =0, j = 1, \dots, n \\ & \|x^*- x\|_2 \leq R \end{aligned}.\]

A locally optimal point need not be globally optimal in general.

Definitions and Theorems from Vector Calculus

Looking at (OptPrb), we need to optimize (minimize) a multi-variable scalar function, i.e., a function of the form, $f: \mathbb{R}^n \rightarrow \mathbb{R}$, where $x \mapsto f(x)$. We will use the notation, $\text{dom}(f)$ and $\text{cdm}(f)$ to respectively denote the domain and co-domains of $f$. Note that $\text{dom}(f) \subseteq \mathbb{R}^n$ and $\text{cdm}(f) \subseteq \mathbb{R}$.

Def. Derivative: Let $v \in \mathbb{R}^n, \|v\|_2 = 1$. The $v$-directional derivative of a function, $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is a function, denoted $D_{v}f: \mathbb{R}^n \rightarrow \mathbb{R}$, where $D_vf(z)$ gives the rate of change of $f$ at $z$ in the direction $v$, and is defined point-wise as \[D_{v}f(z) := \lim_{t \rightarrow 0}\frac{f(z + tv) - f(z)}{t}.\] If $v = e^{(i)}$, we call $D_{v}f$ the partial derivative of $f$ w.r.t. $x_i$, and denote it as $\partial f / \partial x_i$.

Using the aformentioned definition to compute derivatives is difficult. However, there are a few theorems that can help:

Thm. Chain Rule: Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$, $g: \mathbb{R} \rightarrow \mathbb{R}^n$ and define $h := f \circ g$. Then, \[\frac{dh}{dt} = \sum_{i=1}^{n}\frac{\partial h}{\partial g_i}\frac{dg_i}{dt}.\]

It turns out that we can compute the directional derivative of $f$ is any direction, $v$, using just its partial derivatives.

Thm. Computing Directional Derivatives from Partial Derivatives: If $f: \mathbb{R}^n \rightarrow \mathbb{R}$, then for any $z$, \[D_vf(z) = \nabla f^{\top}(z)v\] where \[\nabla f(z)_i := \left.\frac{\partial f}{\partial x_i}\right\vert_{x=z}.\]

See proof.

Define the function $g(t) = z + tv$. Then, \[D_vf(z) = \lim_{t \rightarrow 0}\frac{f(z+tv) - f(z)}{t}\] $\begin{aligned} &= \lim_{t\rightarrow 0}\frac{f\left(g(t)\right) - f(\left(g(0)\right)}{t} \\ &= \lim_{t\rightarrow 0}\frac{h(t) - h(0)}{t} \\ &= \left.\frac{dh}{dt}\right\vert_{t=0} \end{aligned}$

Using the chain-rule, we have \[\left.\frac{\partial h}{\partial t}\right\vert_{t=0} = \sum_{i=1}^{n}\left.\frac{\partial h}{\partial g_i}\right\vert_{t=0}\left.\frac{dg_i}{dt}\right\vert_{t=0}\] $\begin{aligned} &= \sum_{i=1}^{n}\left.\frac{\partial h}{\partial g_i}\right\vert_{t=0}v_i \\ &= \sum_{i=1}^{n}\left.\frac{\partial f}{\partial x_i}\right\vert_{x=z}v_i \\ &= \nabla f^{\top}(z)v \end{aligned}$
$\blacksquare$


We call $\nabla f(z)$ the gradient of $f$ at $z$. We say $f$ is differentiable at $z$ if $\nabla f(z)$ exists.

A vector function, $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ can be interpreted as $m$ scalar functions, $f_1, \dots, f_m: \mathbb{R}^n \rightarrow \mathbb{R}$. Thus, we define its directional derivatives and gradient accordingly, i.e., \[D_vf = \begin{bmatrix} D_vf_1 \\ \vdots \\ D_vf_m \end{bmatrix} \text{ and } \nabla f^{\top} = \begin{bmatrix} \nabla f^{\top}_1 \\ \vdots \\ \nabla f^{\top}_m \end{bmatrix}.\]

We can use the aforementioned definitions to compute higher-order derivatives of a scalar function, $f: \mathbb{R}^n \rightarrow \mathbb{R}$. For example, the 2nd derivative of $f$ is the gradient of the gradient of $f$, i.e., \[\nabla^2 f = \nabla\left(\nabla f\right) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \dots & \frac{\partial^2 f}{\partial x_1x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_nx_1} & \dots & \frac{\partial^2 f}{\partial x_n^2}\end{bmatrix}.\]

Thm. Computing Higher-Order Directional Derivatives from Partial Derivatives: Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$. Then, $D_{u}\left(D_vf\right) = \left(\nabla ^2 fv\right)^{\top}v$, i.e., $\left(\nabla ^2fv\right)^{\top}$ is the $u$-directional derivative of the $v$-directional derivative of $f$.

See proof.

We have \[u^{\top}\nabla^2 f(z)v = \sum_{i=1}^{n}\sum_{j=1}^{n}u_iv_j\frac{\partial^2 f}{\partial x_i \partial x_j}\] $\begin{aligned} &= \sum_{i=1}^{n}\sum_{j=1}^{n}u_iv_j\frac{\partial^2 f}{\partial x_i \partial x_j} \\ &= \sum_{i=1}^{n}u_i\frac{\partial}{\partial x_i}\left(\sum_{j=1}^{n}v_j\frac{\partial f}{\partial x_j}\right) \\ &= \sum_{i=1}^{n}u_i\frac{\partial}{\partial x_i}\left(\nabla f^{\top}v\right) \\ &= \nabla \left(\nabla f^{\top}v\right)^{\top}u \\ &= D_{u}\left(D_{v}f\right). \end{aligned}$
$\blacksquare$


We call $\nabla^2f(z)$ the Hessian of $f$ at $z$. We say $f$ is twice-differntiable at $z$ if $\nabla^2f(z)$ exists.

Conditions for Optimality

We now use the aformentioned defintions to derive a few conditons under which a point, $x^*$ is locally optimal.

Thm. 1st-Order Necessary Condition for Optimality: If $x^*$ is a local minimizer of (OptPrb) and $f$ is differentiable at $x^*$, then, \[\nabla f^{\top}(x^*)v \geq 0, \forall v \in \mathcal{V}(x^*).\]

See proof.

If $f$ is differentiable at $x^*$, then \[\lim_{t\rightarrow 0^+}\frac{f\left(x^*+tv\right) - f(x^*)}{t} = \lim_{t\rightarrow 0}\frac{f\left(x^*+tv\right) - f(x^*)}{t} :=D_vf\left(x^*\right) = \nabla f^{\top}(x^*)v.\tag{1}\] Since $x^*$ is a local minimizer, $\exists R > 0$, such that if $\|z - x^*\|_2 \leq R$, then $f(z) > f(x^*)$.

Without loss of generality, we may express $z = x^* + tv$, where $v \in \mathcal{V}(x^*)$. Thus, the condition, $\|z-x^*\|_2 \leq R$, can be written as $\|tv\|_2 \leq R$, or simply $t \leq R$, since $\|v\|_2 = 1$.

Thus, for all $t \leq R$, we have $f\left(x^*+tv\right) \geq f(x^*)$, which means that \[\lim_{t\rightarrow 0^+}\frac{f\left(x^*+tv\right) - f(x^*)}{t} \geq 0. \tag{2}\] Combining (1) with (2) yields the desired result. $\blacksquare$

Cor. 1st-Order Necessary Condition for Optimality: If $x^*$ is an local minimizer of (OptPrb) and an interior point of $\mathcal{X}$, and $f$ is differentiable at $x^*$, then, \[\nabla f^{\top}(x^*) = 0.\]

See proof.

We already know that for any feasible direction $v \in \mathcal{V}(x^*)$, we require \[\nabla f^{\top}(x^*)v \geq 0.\] Since $x^*$ is an interior point of $\mathcal{X}$, every direction is feasible. In particular, if $v$ is a feasible direction, then so too is $-v$. Thus, we also have \[-\nabla f^{\top}(x^*)v \geq 0 \Leftrightarrow \nabla f^{\top}v \leq 0.\] It follows that $\nabla f^{\top} = 0$.

Duality in Optimization

The conditions derived thus far are fairly weak. However, we can derive stronger conditions in some cases. The key idea is to turn a constrained problem into an unconstrained problem by moving the constraints into the objective. In particular, we consider the problem: \[\text{minimize } f(x) + P(x),\] where \[P(x) = \begin{cases}0 &x \in \mathcal{X} \\ \infty & \text{otherwise.}\end{cases}\] It turns out that \[P(x) = \sup_{\substack{\lambda \geq 0 \\\ \mu}}\left\lbrace \sum_{i=1}^{m}\lambda_ig_i(x) + \sum_{j=1}^{n}\mu_jh_j(x)\right\rbrace,\] where $\lambda \in \mathbb{R}^m$ and $\mu \in \mathbb{R}^{n}_{+}$. To see this, we consider two cases:

Thus, the solution to (OptPrb) can be written as \[\inf_{x}\sup_{\substack{\lambda \geq 0 \\\ \mu}}\left\lbrace f(x) + \sum_{i=1}^{m}\lambda_ig_i(x) + \sum_{j=1}^{n}\mu_jh_j(x) \right\rbrace.\] We define the quantity \[L(x,\lambda,\mu) := f(x) + \sum_{i=1}^{m}\mu_ig_i(x) + \sum_{j=1}^{n}\lambda_jh_j(x),\] as the Lagrangian of (OptPrb). It follows that \[\sup_{\substack{\lambda \geq 0 \\\ \mu}}L(x,\lambda,\mu) = \begin{cases} f(x) & x \in \mathcal{X} \\ \infty & \text{ otherwise.} \end{cases}\] Now suppose we define \[\zeta(\lambda,\mu) = \inf_{x}L(x,\lambda,\mu).\] Then, we have \[\inf_{x \in \mathcal{X}}f(x) = \inf_{x}\sup_{\substack{\lambda \geq 0 \\\ \mu}}L(x,\lambda,\mu) \geq \sup_{\substack{\lambda \geq 0 \\\ \mu}}\inf_{x}L(x,\lambda,\mu) = \sup_{\substack{\lambda \geq 0 \\\ \mu}}\zeta(\lambda,\mu).\] We now have two equaivalent problems:

The duality-gap is $f(x^*) - \zeta(\lambda^*, \mu^*)$, which is always non-negative as per the previous argument. If the duality-gap is zero, i.e., $f(x^*) = \zeta(\lambda^*, \mu^*)$, then we say that the duality is strong; otherwise, we say it is weak. Assuming the duality is strong, we have that \[\inf_{x \in \mathcal{X}}f(x) = \sup_{\substack{\lambda \geq 0 \\\ \mu}}\inf_{x}L(x,\lambda,\mu) = f(x^*)\] which implies that $\lambda_i^*g_i^*(x^*) = 0, \forall i$.

See proof.

We have \[f(x^*) = \inf_{x \in \mathcal{X}}f(x) = \sup_{\lambda,\mu\geq 0}\inf_{x}L(x,\lambda,\mu)\] $\begin{aligned} &= \sup_{\lambda\geq 0, \mu}\inf_{x}\left\lbrace f(x) + \sum_{i=1}^{m}\lambda_ig_i(x) + \sum_{j=1}^{n}\mu_jh_j(x)\right\rbrace \\\\ &= \sup_{\lambda\geq 0, \mu}\left\lbrace f(x^*) + \sum_{i=1}^{m}\lambda_ig_i(x^*) + \sum_{j=1}^{n}\mu_jh_j(x^*)\right\rbrace \\\\ &= \sup_{\lambda\geq 0, \mu}\left\lbrace f(x^*) + \sum_{i=1}^{m}\lambda_ig_i(x^*)\right\rbrace \\\\ &= f(x^*) + \sum_{i=1}^{m}\lambda_i^*g_i(x^*). \end{aligned}

Putting everything together, we have the following conditions:

Karush–Kuhn–Tucker Conditions: $x^*$ is a primal optimizer and $(\lambda^*, \mu^*)$ is a dual optimizer if and only if:

  • primal feasibility: \[g(x^*) \leq 0 \text{ and } h(x^*) = 0\]
  • dual feasibility: \[\lambda_i^* \geq 0, i = 1, \dots, m\]
  • stationariness: \[\nabla L(x^*,\lambda^*, \mu^*) = 0\]
  • complimentary slackness: \[\lambda_i^*g_i(x^*) = 0, i = 1, \dots, m\]

We still need to understand the conditions under which strong duality holds. For this, it can be helpful to develop a geometric intuition of duality. To do this, we define the set \[\mathcal{S} = \lbrace (v,w,t) \text{ s.t. } f(x) \leq t, g(x) \leq v, Ax-b = w \rbrace.\] It is easy to see that $(0,0,f^*) \in \mathcal{S}$. Thus, we can frame the original problem in terms of $\mathcal{S}$: \[\begin{equation}\begin{aligned}\text{minimize } &t \\ \text{subject to } &\begin{bmatrix} 0 & 0 & t \end{bmatrix}^{\top} \in S\end{aligned}\end{equation}\] Since every $x \in \mathcal{D}$ corresponds to at least one $(v,w,t) \in \mathcal{S}$, we may write \[z(\lambda,\mu) = \inf_{x}\left\lbrace f(x) + \lambda^{\top}g(x) + \mu^{\top}h(x) \right\rbrace\] $\begin{aligned} &= \inf_{(v,w,t) \in \mathcal{S}}\left\lbrace t + \lambda^{\top}v + \mu^{\top}w \right\rbrace \\
&= \inf_{(v,w,t) \in \mathcal{S}}\begin{bmatrix} \lambda^{\top} & \mu^{\top} & 1 \end{bmatrix}\begin{bmatrix} v \\\ w \\\ t \end{bmatrix} \end{aligned}$

Since $(0,0,f^*) \in \mathcal{S}$ and \[f^* = \begin{bmatrix} \lambda^{\top} & \mu^{\top} & 1 \end{bmatrix}\begin{bmatrix} 0 \\\ 0 \\\ f^* \end{bmatrix},\] it follows that \[\zeta(\lambda, \mu) \leq \begin{bmatrix} \lambda^{\top} & \mu^{\top} & 1 \end{bmatrix}\begin{bmatrix} v \\\ w \\\ t \end{bmatrix} \leq f^*.\] From a geometric perspective, for a fixed $k$, the set \[\left\lbrace (v,w,t) \text{ s.t. } \begin{bmatrix} \lambda^{\top} & \mu^{\top} & 1 \end{bmatrix}\begin{bmatrix} v \\\ w \\\ t \end{bmatrix} = k\right\rbrace\] represents a hyper-plane normal to $\begin{bmatrix} \lambda^{\top} & \mu^{\top} & 1 \end{bmatrix}^{\top}$ and whose $t$-intercept is $k$. Moreover, $\zeta(\lambda,\mu)$ is the smallest $t$-intercept of those hyper-planes that intersect $\mathcal{S}$. Thus, the dual optimal, $z^*$, is the largest $t$-intercept of all hyper-planes that intersect $\mathcal{S}$ in such a way that shifting any hyper-plane along its normal seperates it from $\mathcal{S}$. An example of $\mathcal{S}$, when $n = 1$ and $m = 0$.

Convex Optimization Problems

Our only hope of solving (OptPrb) is to make some assumptions on $f$, $g_1,\dots,g_m$ and $h_1,\dots,h_n$. To state these assumptions, we will require a few definitions.

Convex Sets

Def. Convex Set: A convex combination of the points, $x_1, \dots, x_n$ is of the form \[\sum_{i=1}^{n}\theta_ix_i\] where $\theta_i \geq 0$, and $\theta_1 + \dots + \theta_n = 1$. A set is convex if it contains all convex combination of its points. The convex hull of a (possibly non-convex) set, $S$, denoted $\text{conv}(S)$, is the set of all convex combinations of its points; $\text{conv}(S) \supseteq S$ and for any other convex set $A$, we have $A \supseteq \text{conv}(S)$.

Examples of Convex Sets

We now consider some examples of convex sets:

Def. Linear Set: A linear combination of the points $x_1, \dots, x_n$ is of the form \[\sum_{i=1}^{n}\theta_ix_i.\] A set is linear if it contains all linear combinations of its points. The linear hull of a (possibly non-linear) set, $S$, denoted $\text{lin}(S)$, is the set of all linear combinations of its points; $\text{lin}(S) \supseteq S$ and for any other linear set $A$, we have $A \supseteq \text{lin}(S)$.

Def. Affine Set: An affine combination of the points $x_1, \dots, x_n$ is of the form \[\sum_{i=1}^{n}\theta_ix_i,\] where $\theta_1 + \dots + \theta_n = 1$.

A set is affine if it contains all affine combinations of its points. The affine hull of a (possibly non-affine) set, $S$, denoted $\text{aff}(S)$, is the set of all affine combinations of its points; $\text{aff}(S) \supseteq S$ and for any other affine set $A$, we have $A \supseteq \text{aff}(S)$.

Def. Conic Set: A conic combination of the points $x_1, \dots, x_n$ is of the form \[\sum_{i=1}^{n}\theta_ix_i, \theta_i \geq 0.\] A set is conic if it contains all conic combinations of its points. The conic hull of a (possibly non-conic) set, $S$, denoted $\text{cone}(S)$, is the set of all conic combinations of its points; $\text{cone}(S) \supseteq S$ and for any other conic set $A$, we have $A \supseteq \text{cone}(S)$.

Def. Hyperplane: A hyperplane, is a set of the form, $\left\lbrace a^{\top}x = b \right\rbrace$, where $a \neq 0$, where $a$ is the vector normal to the plane, and $b/|a|$ is the distance between the plane and the origin in the direction of $a$. A hyper-plane in $\mathbb{R}^2$ defined by $a^{\top}x = b$. To understand the geometry, we rewrite $a^{\top}x = b$ as $a^{\top}(x-c) = 0$, where $c$ is some point on the hyper-plane, i.e., $a^{\top}c = b$. It follows that $a^{\top}c/|a| = b/|a|$. But $a^{\top}c/|a|$ is the magnitude of the projection of $c$ onto $a$, i.e., $b/|a|$ is the distance between the origin and the hyper-plane in the direction of $a$.

Two convex sets can be separated using a hyperplane.

Thm. Seperating Hyperplanes: Let $P, Q \subseteq \mathbb{R}^n$ be two convex sets such that $P \cap Q = \emptyset$. Then, there exists an $a \in \mathbb{R}^n$ and $b \in \mathbb{R}$ such that for all $x \in P$, we have $a^{\top}x + b \geq 0$, and for all $x \in Q$, we have $a^{\top}x + b \leq 0$. The hyper-plane $\lbrace x \text{ s.t. } a^{\top}x = b \rbrace$ is called a seperating hyper-plane of $P$ and $Q$.

See proof.

Let $p^* \in P$ and $q^* \in Q$ be solutions to the program \[\text{minimize } \|p-q\|_2^2.\] Then for any $q \in Q$, we have \[\|p^* - q^*\|_2^2 \leq \|p^* - q\|_2^2.\] First, let $x \in Q$. Then for any $t \in [0,1]$, the point $tx + (1-t)q^* \in Q$ and so \[\|p^* - q^*\|_2^2 \leq \|p^* - (tx + (1-t)q^*)\|_2^2\] $\begin{aligned} &= \|p^* - tx - (1-t)q^*\|_2^2 \\ &= \|p^*-q^*\|_2^2 - 2t(p^*-q^*)^{\top}(x-q^*) + t^2\|x-q^*\|_2^2. \end{aligned}$

It follows that \[-2t(p^*-q^*)^{\top}(x-q^*) + t^2\|x-q^*\|_2^2 \geq 0\] $\begin{aligned} &\Rightarrow -2(p^*-q^*)^{\top}(x-q^*) + t\|x-q^*\|_2^2 \geq 0 \\ &\Rightarrow \lim_{t\rightarrow 0}\left\lbrace -2(p^*-q^*)^{\top}(x-q^*) + t\|x-q^*\|_2^2 \right\rbrace \geq 0 \\ &\Rightarrow -2(p^*-q^*)^{\top}(x-q^*) \geq 0 \\ &\Rightarrow (p^*-q^*)^{\top}(x-q^*) \leq 0 \\ &\Rightarrow (p^*-q^*)^{\top}x - (p^* - q^*)q^* \leq 0, x \in Q. \end{aligned}$

Now $x \in P$. Then for any $t \in [0,1]$, the point $tx + (1-t)p^* \in P$ and so \[\|p^* - q^*\|_2^2 \leq \|(tx + (1-t)p^*- q^*)\|_2^2\] $\begin{aligned} &= \|(p^* - q^*) +t(x-p^*)\|_2^2 \\ &= \|p^*-q^*\|_2^2 + 2t(p^*-q^*)^{\top}(x-p^*) + t^2\|x-p^*\|_2^2. \end{aligned}$

It follows that \[2t(p^*-q^*)^{\top}(x-p^*) + t^2\|x-p^*\|_2^2 \geq 0\] $\begin{aligned} &\Rightarrow 2(p^*-q^*)^{\top}(x-p^*) + t\|x-p^*\|_2^2 \geq 0 \\ &\Rightarrow \lim_{t\rightarrow 0}\left\lbrace 2(p^*-q^*)^{\top}(x-p^*) + t\|x-p^*\|_2^2 \right\rbrace \geq 0 \\ &\Rightarrow 2(p^*-q^*)^{\top}(x-p^*) \geq 0 \\ &\Rightarrow (p^*-q^*)^{\top}(x-p^*) \geq 0 \\ &\Rightarrow (p^*-q^*)^{\top}x - (p^*-q^*)^{\top}p^* \geq 0, x \in P. \end{aligned}$

Thus, for any $\Delta \geq 0$, we have \[a^{\top}x - a^{\top}q^* - \Delta \leq 0, \forall x \in Q,\] and \[a^{\top}x - a^{\top}p^* + \Delta \geq 0, \forall x \in P,\] where $a = p^* - q^*$.

If we can find a $\Delta \geq 0$ for which $-a^{\top}q^* - \Delta = -a^{\top}p^* + \Delta = b$, the proof is complete. We have \[2\Delta = -a^{\top}q^* + a^{\top}p^*\] $\begin{aligned} &\Rightarrow 2\Delta = a^{\top}a \\ &\Rightarrow \Delta = \frac{\|a\|_2^2}{2}. \end{aligned}$

Thm. Supporting Hyperplanes: Let $P \subseteq \mathbb{R}^n$ be a convex set. Then, for each $x^{(0)} \in \partial P$, there exists an $a \in \mathbb{R}^n$ such that for all $x \in P$, we have $a^{\top}x \leq a^{\top}x^{(0)}$. The hyper-plane $\lbrace x \text{ s.t. } a^{\top}x = a^{\top}x^{(0)} \rbrace$ is called a supporting hyper-plane of $P$ at $x^{(0)}$.

See proof.

We consider two cases:

  1. if $\text{int}(P) \neq \emptyset$, then we can find a hyperplane seperating $\text{int}(P)$ and $ \lbrace x^{(0)} \rbrace$ since both are disjoint convex sets.

Def. Half-Space: A halfspace, is a set of the form, $\left\lbrace a^{\top}x \leq b \right\rbrace$, where $a \neq 0$. A half-space in $\mathbb{R}^2$ defined by $a^{\top}x \leq b$.

Def. Polyhedron: A polyhedron is the solution set of finitely many linear equalities and/or inequalities, i.e., \[\left\lbrace x \text{ s.t. } Ax \preceq b, Cx = d\right\rbrace.\] Equivalently, a polyhedron is the intersection of finitely many half-spaces and hyper-planes. A polyhedron in $\mathbb{R}^2$ defined by the half-spaces, $(a^{(i)})^{\top}x \leq b^{(i)}, i = 1, 2, 3$.

Def. Norm-Ball: A norm ball of radius, $r$, with center $x_c$, is a set of the form \[\left\lbrace x \text{ s.t. } \|x - x_c\| \leq r\right\rbrace.\]

Def. Norm-Cone: A norm cone is a set of the form \[\left\lbrace (x,t) \text{ s.t. } \|x\| \leq t \right\rbrace.\]

Operations that Preserve Convexity

To show that a set, $S$, is convex, one can resort to the definition, i.e., show that \[x_1, x_2 \in S, t \in [0, 1] \Rightarrow tx_1 +(1-t)x_2 \in S\]However, this is often difficult to do in practice.

Another option is to show that $S$ can be formed from convex sets using operations that preserve convexity. Such operations include, but are not limited to:

Convex Functions

Def. Convex Function
A function, $f: \mathbb{R}^n \rightarrow \mathbb{R}$, is convex, iff $\text{dom}(f)$ is a convex set and \[f(tx + (1-t)y) \leq tf(x)+(1-t)f(y)\]for all $x, y \in \text{dom}(f)$ and $t \in [0,1]$. We say that $f$ is strictly convex if it is convex and \[f(tx+(1-t)y) < tf(x) +(1-t)y\] for $t \in (0,1)$.

Necessary and Sufficient Conditions for Convexity

To check whether a function, $f$, is covnex or not, one can use the aformentioned definition. However, depending on $f$ this may be very tedious. Thus, it is useful to derive some other conditions that are both necessary and sufficient for $f$ to be convex.

Thm. Characterization w.r.t. the Epigraph
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$. We define its epi-graph as the set of all points above $f$, i.e., \[\text{epi}(f) = \left\lbrace (x,t) \in \mathbb{R}^{n+1} \text{ s.t. } f(x) \leq t \right\rbrace.\] $f$ is a convex function if and only if $\text{epi}(f)$ is a convex set.

Thm. Necessary Condition w.r.t. Sub-Level Sets Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$. We define the $\alpha$ sub-level set of $f$ as the set of all points which map to values that are less than or equal to $\alpha$ under $f$, i.e., \[\text{lvl}_{\leq \alpha}(f) = \left\lbrace x \in \mathbb{R}^n \text{ s.t. } f(x) \leq \alpha \right\rbrace.\] If $f$ is convex then $\text{lvl}_{\leq \alpha}(f)$ is also convex.

Thm. Covnexity under Linear Restrictions
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ and define $g(t) := f(x_0 + tv)$. Then, $f$ is convex (in $x$) if and only if $g$ is convex (in $t$) for every $x_0 \in \text{dom}(f)$ and $v \in \mathbb{R}^n$.

We will make use of the aformentioned condition to derive two additional conditions.

Thm. First-Order Characterization of Convexity
If $f$ is differentiable, then it is convex iff $f(y) \geq f(x) + \nabla f(x)^{\top}(y-x).$

See proof.

Let $g(t) = f(x_0 + tv)$. Then $f$ is convex if and only if $g$ is convex for all $v \in \mathbb{R}$, $x_0 \in \textrm{Dom}(f)$, where we restrict the domain of $t$ so that $x_0 + tv \in \textrm{Dom}(f)$.

$\Rightarrow$: Since $g$ is a scalar function of $t$, if it is convex, then \[\underbrace{g(t) \geq g(0) + g'(0)t, \forall t}_{*}.\] We compute \[g'(0) = \left.\frac{d}{dt}g(t)\right\rvert_{t=0} = \nabla f^{\top} \left(\left.x_0 + tv\right\rvert_{t=0}\right)v = \nabla f^{\top}(x_0)v.\] Making the appropriate substitutions for $g$ and $g'(0)$ into $*$ we get \[f(x_0+tv) \geq f(x_0) + \nabla f^{\top}(x_0)tv.\] Finally, we let $x = x_0 + tv$ and conclude that \[f(x) \geq f(x_0) + \nabla f^{\top}(x_0)(x-x_0)\] as desired.

$\Leftarrow$: Now assume $f(x) \geq f(x_0) + \nabla f^{\top}(x_0)(x-x_0)$.

Assume $x$ lies on an arbitrary line, $x = tu + (1-t)v$, where $u, v \in \textrm{Dom}(f)$. Let $t_0$ be chosen so that $x_0 = t_0u + (1-t_0)v$.

Making the appropriate substitutions, we have \[\begin{aligned} f\left(tu + (1-t)v\right) &\geq f\left(t_0u + (1-t_0)v\right) \\ &+ \nabla f^{\top}\left(t_0u + (1-t_0)v\right)\left(tu + (1-t)v - t_0u - (1-t_0)v\right) \end{aligned}\] $\begin{aligned} &= f\left(t_0u + (1-t_0)v\right) + \nabla f^{\top}\left(t_0u + (1-t_0)v\right)\left((t - t_0)u - (t-t_0)v\right) \\ &= f\left(t_0u + (1-t_0)v\right) + \nabla f^{\top}\left(t_0u + (1-t_0)v\right)\left((t - t_0)(u - v)\right)(**). \end{aligned}$

Let $g(t) = f\left(tu + (1-t)v\right)$. Then, we compute \[g'(t) = \nabla f\left(tu + (1-t)v\right)(u-v).\] Substituting $g$ and $g'$ into $**$, we obtain \[g(t) \geq g(t_0) + g'(t_0)(t-t_0),\] which means that $g$ is convex on the line defined by $u$ and $v$. Since $u$ and $v$ are arbitrary, it follows that $g$ is convex on any line through $\textrm{Dom}(f)$. Therefore, we conclude that $f$ is also convex. $\blacksquare$

Thm. First-Order Characterization of Convexity
If $f$ is twice-differentiable, then it is convex iff $\nabla^2f(x) \succeq 0$, and strictly convex if $\nabla^2f(x) \succ 0$ for all $x \in \text{dom}(f)$.

See proof.

Let $g(t) = f(x_0 + tv)$. Then $f$ is convex if and only if $g$ is convex for all $v, x_0 \in \textrm{Dom}(f)$, where we restrict the domain of $t$ so that $x_0 + tv \in \textrm{Dom}(f)$.

Since $g$ is a scalar function of $t$, it is convex if and only if \[g''(t) \geq 0.\] We compute \[g''(t) = v^{\top}\nabla^2f(x_0 +tv)v.\] Thus, $g$ is convex if and only if \[\underbrace{v^{\top}\nabla^2f(x_0 + tv)v \geq 0}_{*}\] for any $x_0 + tv \in \textrm{Dom}(f)$. Since $x_0$ can be chosen arbitrarily and $t$ can be made arbitrarily small, $*$ holds for every line $x_0 + tv \in \textrm{Dom}(f)$ if and only if \[v^{\top}\nabla^2f(x)v \geq 0\] for every $x, v \in \textrm{Dom}(f)$. In other words, $g$ is convex (and consequently $f$ is convex) if and only if $\nabla^2 f(x) \succeq 0$ for every $x \in \textrm{Dom}(f)$. $\blacksquare$

Operations that Preserve Convexity

There are several operations that preserve convexity. Such operations include, but are not limited to:

Convex Optimization Problems

A convex optimization problem in stanadrd form is one in which:

The reason $h_j$ must be affine (as opposed to convex) is that the constraint $h_j(x) = 0$ is equivalent to the constraints $h_j(x) \leq 0$ and $-h_j(x) \leq 0$. Thus, we need both $h_j$ and $-h_j$ to be convex. This is only true if $h_j$ is linear.

Written out exlicitly, a convex program is of the following form: \[\begin{equation}\begin{aligned}\text{minimize } &f(x) \\ \text{subject to } &g_i(x) \leq 0, i = 1,\dots, m \\ &a^{\top}_jx + b_j =0, j = 1, \dots, n\end{aligned}\end{equation}\]

Often, we will simply replace the (scalar) equality constraints with the sole vector constraint $Ax=b$, where \[A = \begin{bmatrix} \vert & & \vert \\ a_1 & \dots & a_p \\ \vert & & \vert \end{bmatrix} \text{ and } b = \begin{bmatrix} b_1 \\ \vdots \\ b_p \end{bmatrix}.\]

Thus, the program becomes \[\begin{equation}\begin{aligned}\text{minimize } &f(x) \\ \text{subject to } &g_i(x) \leq 0, i = 1,\dots, m \\ &Ax + b = 0\end{aligned}\end{equation}\]

There are a few important properties of convex optimization problems:

First-Order Conditions

If $f_0$ is differentiable, then $x$ is locally optimal if and only if \[\nabla f_0(x)^{\top}(y-x) \geq 0, \forall y \in \mathcal{X}.\]

See proof.

We want to show that \[f_0(x) \leq f_0(y), \forall y \in \mathcal{X} \Leftrightarrow f_0^{\top}(x)(x-y)\] $\Rightarrow$: Assume $\nabla f_0^{\top}(x)(x-y) \geq 0$. Since $f_0$ is convex and differentiable, we have \[f_0(y) \geq f_0(x) + \nabla f_0^{\top}(x)(x-y) \geq f_0(x).\] $\Leftarrow$: Suppose $\exists y \in \mathcal{X}$ such that $\nabla f_0^{\top}(x)(y-x) < 0$, and define $z(\theta) = \theta y + (1-\theta) x$, where $\theta \in [0, 1]$. Since $\mathcal{X}$ is convex, it follows that $z(\theta) \in \mathcal{X}$. We compute \[\nabla f_0^{\top}(x)(y-x) := \lim_{\theta\rightarrow 0}\frac{f_0\left(z(\theta)\right) - f_0(x)}{\theta} = \lim_{\theta\rightarrow 0}\frac{f_0\left(x + \theta (y-x)\right) - f_0(x)}{\theta}< 0.\] Thus, for every $\epsilon > 0$, there exists a $\delta > 0$, such that \[\frac{f_0\left(z(\theta)\right) - f_0(x)}{\theta}\] However, this means that for every $R > 0$,

Second-Order Conditions

If $f_0$ is differentiable, then $x$ is locally optimal if and only if \[\nabla f(x)^{\top}(y-x) = 0 \Rightarrow (y-x)^{\top}\nabla^2f(x)(y-x) \geq 0.\]

Operations that Preserve Convexity

Duality in Convex Optimization

It turns out that strong duality holds for most convex problems. In the case of a convex optimization problem, we can show that the resulting $\mathcal{S}$ is also convex.

Thus, there exists a supporting hyperplane to $\mathcal{S}$ at $(0,0,f^*)$ and for any $(v,u,t) \in \mathcal{S}$, we may write \[\begin{bmatrix} \chi^{\top} & \eta^{\top} & w \end{bmatrix}\begin{bmatrix} v \\\ u \\\ t \end{bmatrix} \geq \begin{bmatrix} \chi^{\top} & \eta^{\top} & w \end{bmatrix}\begin{bmatrix} 0 \\\ 0 \\\ f^* \end{bmatrix} = wf^*,\] or equivalently, \[\sum_{i=1}^{m}\chi_iv_i + \sum_{j=1}^{n}\eta_ju_j + wt \geq wf^*.\] The latter form shows that $\chi, w \geq 0$. To see this, note that $\mathcal{S}$ is closed under addition w.r.t. $\mathbb{R}^m_{+} \times \lbrace 0 \rbrace \times \mathbb{R}_{+}$. Thus if $\chi_i < 0$ (resp. $w < 0$), we choose $v_i$ (resp. $t$) to be arbitrarily large and the left-side becomes unbounded from below.

We say the hyperplane is non-vertical if $w \neq 0$. Assuming $w \neq 0$, it follows that \[\begin{bmatrix} \chi^{\top}/w & \eta^{\top}/w & 1 \end{bmatrix}\begin{bmatrix} v \\\ u \\\ t \end{bmatrix} \geq f^*,\] where $\chi^{\top}/w \geq 0$. Since $(g(x),h(x),f(x)) \in \mathcal{S}$, we have \[\begin{bmatrix} \chi^{\top}/w & \eta^{\top}/w & 1 \end{bmatrix}\begin{bmatrix} g(x) \\\ h(x) \\\ f(x) \end{bmatrix} = \zeta\left(\chi/w, \eta/w\right) \geq f^*.\] Since $\chi/w \geq 0$, we know that $\zeta\left(\chi/w, \eta/w\right) \leq f^*$. It follows that $\zeta\left(\chi/w, \eta/w\right) = f^*$.

Thm. Slater’s Condition: If $\exists x \in \mathcal{X}$, such that $g(x) < 0$, then there exists a non-vertical supporting hyperplane to $\mathcal{S}$ at $(0,0,f^*)$.

See proof.

Assume $\tilde{x}$ satisfies Slater's condition, i.e., $g(\tilde{x}) < 0$ and $A\tilde{x} - b = 0$. It follows that $\chi^{\top}g(\tilde{x}) < 0$ since $\chi \geq 0$. If we now assume that $w \neq 0$, we have $\chi^{\top}g(\tilde{x}) \geq 0$. Since $g(\tilde{x}) < 0$ and $\chi \geq 0$, it must actually be that $\chi = 0$. Since $(\chi, \eta, w) \neq 0$, it follows that $\eta \neq 0$.
$\blacksquare$

Examples of Convex Optimization Problems

We now consider various types of convex optimization problem.

Everything we’ve discussed this far can be generalized by generalizing inequalities in multiple dimensions.

Def. Generalized Inequality:
Consider a proper cone, $K \subseteq \mathbb{R}^n$, and $x, y \in \mathbb{R}^n$. We say that $x \leq_K y$ iff $(y-x) \in K$, and $x <_K y$ iff $(y-x) \in \text{int}(K)$. Note that $x \leq_K y, y \leq_K x \not\Rightarrow x = y$.

An example of generalized inequalities using a proper cone in $\mathbb{R}^2$.

This leads to the general convex optimization problem: \[\begin{equation}\begin{aligned}\text{minimize } &f(x) \\ \text{subject to } &g_i(x) \leq_K 0, i = 1,\dots, m \\ &Ax=b\end{aligned}\end{equation}\] In the case wehre $K = \mathbb{R}_{+}^p$, we obtain the standard component-wise inequality.