This is a summary of the math behind the support vector machine. Start from the constrained optimization problem, solving one using Lagrangian function, and application to support vector machine.

# Lagrangian function and primal-dual

A general (primal) optimization problem can be prosed as follows:

This is a standard form, using min and $\le$.

The corresponding Lagrangian function is

where $u_i$ and $v_j$ are Lagrangian multipliers and $u_i\ge 0$ as they are corresponding to inequalities upperbounded by 0, and $v_j\in\mathbb{R}$ can be positive or negative.

Lemma 1: Any feasible $x$ satisfies $f(x) = \sup_{u\ge 0, v} L(x, u, v)$ and the supremum attains at $u_ih_i(x)=0$

Proof of Lemma 1: $x$ is feasible if $h_i(x)\le 0$ and $l_j(x)=0$. Thus we have

and the equality holds iff $u_i h_i(x) = 0$ for all $i$.

The above introduced the Karush-Kuhn-Tucker optimality condition. We will use this a lot in solving the Lagrangian function for optimality.

Lemma 2: Optimal value $f^{\ast} = \min_x f(x)$ satisfies

Proof of Lemma 2: Any feasible $x$ satisfies $f(x) = \sup_{u\ge 0, v} L(x, u, v)$. Thus by substitution we have $f^{\ast}$.

## Duality

in some cases, we can swap the operator to have

which we defined the Lagrangian dual function as

The infimum does not require $x$ to be in the feasible region. So the corresponding dual problem of constrained optimization is:

and the optimal value is

Lemma 3: The dual problem is always convex

Proof of Lemma 3: By definition of $L(x, u, v)$, we have

thus $g(u,v)$ is a pointwise infimum of affline functions of $u$ and $v$. Therefore, the dual problem is a maximization of concave function.

We have optimal values $f^{\ast}$ from the primal problem and $g^{\ast}$ from the dual problem. We always have $f^{\ast} \ge g^{\ast}$ in minimization. This is called the weak duality. In case of $f^{\ast} = g^{\ast}$, it has a strong duality.

# Examples of deriving Lagrangian dual

## Least square solution of linear equation

For vectors $x$ and $b$ and matrix $A$,

The Lagrangian function is:

to look for $\inf_x L(x,v)$ we solve for

Therefore the dual function is

Assume we have a symmetric positive semi-definite matrix $Q \succeq 0$

The Lagrangian function is

Deriving the dual function,

Substitute back into $L(x,u,v)$,

If we relax the constraint of $x \ge 0$, we have a quadratic programming problem with equality constraints

Which then,

which we can rewrite the above equations into matrix form and solve it:

## QCQP

If we define $P(\lambda) = P_0 + \sum_i\lambda_iP_i$, $q(\lambda) = q_0 + \sum_i\lambda_iq_i$, and $r(\lambda) = r_0 + \sum_i\lambda_ir_i$, then the Lagrangian function is

then,

Substitute back into $L(x,u,v)$,

So the corresponding dual problem is

# Binary classification with support vector machine

Traditionally a binary classifier has output of $\pm 1$. Assume the input is $x$ and output $y=\pm 1$ are given for $N$ data points. If the data are linearly separable, we look for maximized margin to the hyperplane $w^Tx+b$ which

and we use $y_i = \mathrm{sgn}(w^T x_i + b)$ as classifier.

The optimization problem is therefore

The above are called hard margin. If we allow error in classification, we have the soft margin model:

which $\xi_i>0$ is the error margin and $c>0$ is the trade off between margin and loss. Here wrong classification is allowed when $\xi_i > 1$ but we aimed to minimize $\xi_i$ so to avoid misclassification as much as possible.

The Lagrangian function in this case is therefore

with $\alpha_i\ge 0$ and $\nu_i\ge 0$. Solving the differential,

Substitute $w=\sum_i\alpha_iy_ix_i$ back into $L(w,b,\xi,\alpha,\nu)$ to find the dual function:

and the corresponding dual problem:

Solving this dual problem gives $\alpha_i$, which we can find the optimal $w^{\ast} = \sum_i\alpha_iy_ix_i$. Note that $w^{\ast}$ depends only on the data points $(x_i,y_i)$ for which $\alpha_i\ne 0$. Those data points are the support vectors of the model which gives the name. To find the offset $b$ of the hyperplane, we have for each $(x_i, y_i)$,

and we have a different $b$ for each $(x_i, y_i)$. But we usually use the mean of all data in the result:

We prefer to solve a SVM with dual problem as it gives hyperplane parameters in closed form with only dot products.

The complementary slackness condition tells some additional information on the error and the Lagrangian multiplier:

• In case of correct classification, $\alpha_i=0$, we have $\nu_i = c > 0$ and therefore $\xi_i = 0$. In this case,

• In case of the data point is on the hyperplane, $% $ and $\nu_i>0$ and $\xi_i=0$. In this case,

• In case of some margin error, $\alpha_i = c > 0$, we have $\nu_i = 0$ and $\xi_i \ge 0$. In this case

## Kernel tricks

SVM works best if data are linearly separable. Otherwise the classification may give error. However, we may convert data from non-linearly separable into linearly separable by projecting data into higher dimension, through function $\phi(x)$:

But in practice, we use kernel function in the model instead of projecting all data into higher dimension before feeding into the model:

The kernel function is by definition symmetric. The SVM with kernel function becomes

and the classifier is

The following are some common kernels:

Squared kernel $K(x,y) = (x^T y)^2 = (\sum_i x_iy_i)(\sum_j x_jy_j)$. In case of a 3-vector $% $, the kernel $K(x,y)=\phi(x)^T\phi(y)$ is equivalent to

but computing $\phi(x)$ is $O(N^2)$ while computing $K(x,y)$ is only $O(N)$.

Quadratic kernel $K(x,y) = (x^Ty+c)^2 = \sum_i\sum_j(x_ix_j)(y_iy_j) + \sum_i(\sqrt{2c}x_i)(\sqrt{2c}y_i) + c^2$. In case of a 3-vector $% $, the kernel $K(x,y)=\phi(x)^T\phi(y)$ is equivalent to

Polynomial kernel $K(x,y) = (x^Ty+c)^d$ for some positive integer $d$. This kernel maps vector $x$ into a feature space of monomials of the form $x_{i_1}x_{i_2}\cdots x_{i_d}$ up to order $d$. The feature space is of dimension $\binom{N+d}{d}$ but the kernel computation is still $O(N)$.

Gaussian kernel (a.k.a. radial basis function, RBF) $K(x,y) = \exp\left(-\dfrac{\lVert x-y\rVert^2}{2\sigma^2}\right)$. This can be viewed as an infinite-degree polynomial kernel.

Other common kernels:

• Linear: $K(x,y) = x^Ty$
• Two-layer neural network: $K(x,y) = \tanh(\kappa x^T y + \theta)$

Kernel function can be written as a matrix: For $m$ vectors $x_1, x_2, \cdots, x_m$, we define a $m\times m$ kernel matrix $K = (k_{ij})_{m\times m}$ as

The kernel matrix $K$ is symmetric and positive semi-definite (i.e., $z^T K z \ge 0$ for all $z$). This is the necessary and sufficient condition for a valid kernel, as proved by the Mercer theorem.

## Application of kernel trick

Handwritten digit recognition: Input is 16x16 pixels of binary value transformed into a 256-vector, but a polynomial kernel or Gaussian kernel can give very good performance.

String matching: There are $26^k$ possible strings of length $k$ on case-insensitive alphabets. For $k=4$, $26^4 \approx 460000$. If we build a model to classify text based on count of occurrences of all $k$-gram from a long string, the size of vector $\phi(x)$ will be intractable. But for a kernel, we can compute $K(x,y)=\phi(x)^T\phi(y)$ by dynamic programming.