Alright before we start, I’m gonna call the Conditional Gradient Algorithm as Frank-Wolfe Algorithm (FW), attributing to the authors of the first ever paper on these methods viz., Marguerite Frank and Philip Wolfe in 1956!  The simplex method (one of the top 10 algorithms of the 20th century, I’d say that simplex gets the first place for me) was introduced by Dantzig (I strongly encourage to read the history in the wiki page, it’s amazing!) in 1947 for solving linear optimization (LO) problems. A LO algorithm solves the following question: Let’s say we are given with the following two types of data viz., 1. A linear function $f$ in say $\mathbb{R}^n$; 2. Legal set: A (sub)set of points $C\subseteq \mathbb{R}^n$ that can described just using linear equations or inequalities. Now how can I pick the point in the legal set $C$ where $f$ achieves the minimum value (or maximum value, it doesn’t matter in this case only, in general this is not true).

We will refer to the legal set as the “feasible” set since that’s the convention. It’s very important to assume that $C$ can be encoded only using linear equations or inequalities, such a set can be equivalently called as a polyhedron (or convex polyhedron). For example, in two dimensions, regular polygons are cool whereas these guys are not cool! To summarize the simplex algorithm in one line: Dantzig observed and proved that the point we are looking for is always one of the vertices of the polyhedron, to be precise, one of the vertices optimizes $f$ or in other words, optimal solution can be assumed to be at one of the vertices. The last line is not completely true, consider the problem of $\max x ~~\text{s.t. } x\geq 0$ or in other words, find the maximum nonnegative number. The main problem with this example is that the feasible set, in this case, the set of nonnegative numbers is unbounded. The cool fact about the simplex method is that, when it finishes (believe me that this happens in finite time),  it will either give us the point we are looking for or inform us that the problem is unbounded, that is, it can keep going in a direction and get better and better values (pretty cool or no?). Anyway, in this post we will assume that the feasible set is bounded so the answer to the LO can be found at a vertex, also known as an extreme point of the feasible set. So in principle, if you have a list of extreme points, we are done. Simplex algorithm basically computes those for us. Simplex is (and probably will be) the choice to solve LO’s even after 60 years!

To the theory folks: Simplex algorithm is not a polynomial time algorithm in the worst case, see here for an example. In fact,  many (read all) attempts to fix it were in vain. But thankfully, many smart people were able to show that LO’s can be solved in polynomial time with different algorithms, eg: Karmarkar’s algorithmEllipsoid algorithm. But these were so inefficient to solve practical instances, so simplex was always preferred for mysterious reasons. Recently, Daniel Spielman (and coauthors) came up with a new concept called Smoothed analysis which successfully explained the performance of the Simplex algorithm. This post is not about the Simplex algorithm but I felt like these are needed to appreciate FW method and it sets up the stage very nicely. Practically, let’s say you have purchased a commercially available LO solver eg: CPLEX, Gurobi etc., then can I use it to solve say convex optimization problems? The answer is a resounding yes provided you’re 1. smart, 2. Patient, 3. give up few things. This is similar to using a laptop (internet) to watch TV shows. Smart: know to use an internet browser, a web search engine; Patient: TV cables can transmit huge amount of data much much faster than internet as of now; 3. Give up: can’t watch all the shows (obviously, right?) but can watch the popular/important shows.

FW algorithm solves the following question: Let’s say we are given with the following two types of data viz., Here is the given data: 1. a differentiable convex function $f$; 2. a feasible set that is convex and compact $C$.Now how can I pick the point in  $C$ where $f$ achieves the minimum value. Few comments are in order: 1. the class of optimizable functions $f$ is much bigger (a linear function is convex and differentiable); 2.$C$ is much bigger in some sense, bounded (so compact) polyhedrons (so convex) are cool, but we do require $C$ to be bounded and for the uninitiated, think of compact being bounded and that inequalities are always of the form $\geq$ or $\leq$ and not strict $<$ or $>$. You can only have equations that are linear (or affine) because: $\{x: f(x)=0\}$ is convex if and only if $f$ is affine (exercise!). Alright, here is the algorithm:

• Pick a point $x_0\in C$.
• Repeat this:
1. Compute the gradient $g:=\nabla f(x_t)$
2. Compute $s:=\arg\min_{s\in C}s^Tg$
3. Set $x_{t+1}=(1-\gamma_t)x_t+\gamma_ts$

Picking $x_0$ is not actually required, since step 2. will anyway make the next iterate to be in $C$ so that’s cool. That leads us to the following amazing observation: assuming that we can compute $g$ reasonably well, we have essentially converted a nonlinear (but convex) optimization into a series of optimization problems with linear function that provably converges! If $C$ is a polyhedron, then we solve a LO at each iteration. Secondly, we will see how to choose the step sizes $0\leq \gamma_t\leq 1$ when we see the proof. So basically the algorithm at a very high level is impressive if step 2. can be done fast, it turns out that this is true for many problems in machine learning (I’ll post some references later).

Convergence: We will just discuss the most basic convergence proof. I should mention that lot of work has gone into making the algorithm faster and the proof more tighter in general and in specific cases. Intuitively, if $f$ is linear, then we are done in one iteration by setting $\gamma_1=1$, surprise yay! This suggests that we need a measure of the linearity of the function, let’s call it $K_f$, the curvature constant of $f$.

Definition: $K_f:=\sup_{x,y\in C,\alpha\in[0,1]} \frac{1}{\alpha^2}\left(f(y)-(y-x)^T\nabla f(x)-f(x) \right)$

Basically it measure how about the linear approximation is for a given function $f$ on the feasible set $C$.

Definition: The Wolfe dual $\omega(x)$ is defined as $\omega(x):=\min_{s\in C} f(x)+(s-x)^Tg$ where $g:=\nabla f(x)$.

Weak Duality: $\omega(x)\leq f(y)$ for all $x,y\in C$.

Proof: Take any $x,y\in C$. By the minimization in the definition of $\omega(x)$, we have that,

$\omega(x)\leq f(x)+(y-x)^Tg\leq f(y)$

where the second inequality is due to the convexity of $f$. $\square$

Denote the primal-dual gap at $x$ by $l(x)$ defined as,

$l(x):=f(x)-\omega(x)=\max_{y\in C}(x-y)^Tg$

So, by weak duality we have that, $l(x)\geq f(x)-f(x^*)\geq 0$. With these we prove a step wise improvement in the value of the objective function.

Stepwise improvement: For any $x_{t+1}:=x_t+\gamma(s-x_t)$ with arbitrary step size $\gamma\in[0,1]$, we have that,

$f(x_{t+1})\leq f(x_t)-\gamma l(x_t)+\gamma^2K_f$

Proof: Let $x:=x_t, y:=x_{t+1}=x+\gamma(s-x)$. From the definition of $K_f$,

$\gamma^2K_f\geq f(y)-(y-x)^T\nabla f(x)-f(x)$

$\quad=f(y)-\gamma(s-x)^T\nabla f(x)-f(x)$

Rewriting the above inequality and using the definition of $l(x)$ we have that,

$f(y)\leq f(x)-\gamma l(x)+\gamma^2 K_f\quad \square$.

Main convergence: $f(x_{t+1})-f(x^*)\leq \mathcal{O}\left(\frac{K_f}{\sqrt{t}}\right)$

Proof: Define $h(x):=f(x)-f(x^*)$. Use the stepwise improvement lemma and use induction. For the induction step use the step sizes $\gamma_t=\frac{2}{t+2}$. TODO: Finish the proof with the details. $\square$.

Remarks: 1. No need to use a step size strategy as in many other algorithms because of the predetermined stepsizes; 2. convergence is pretty slow in general since it is sublinear; 3. proof goes through if you choose to use line search strategies(for example, wolfe conditions) as well; 4. Need $K_f$ to be small or at least bounded for the convergence to work. Next time we will see some examples.