This is a summary of the math behind the support vector machine. Start from the constrained optimization problem, solving one using Lagrangian function, and application to support vector machine.

Lagrangian function and primal-dual

A general (primal) optimization problem can be prosed as follows:

This is a standard form, using min and .

The corresponding Lagrangian function is

where and are Lagrangian multipliers and as they are corresponding to inequalities upperbounded by 0, and can be positive or negative.

Lemma 1: Any feasible satisfies and the supremum attains at

Proof of Lemma 1: is feasible if and . Thus we have

and the equality holds iff for all .

The above introduced the Karush-Kuhn-Tucker optimality condition. We will use this a lot in solving the Lagrangian function for optimality.

Lemma 2: Optimal value satisfies

Proof of Lemma 2: Any feasible satisfies . Thus by substitution we have .

Duality

in some cases, we can swap the operator to have

which we defined the Lagrangian dual function as

The infimum does not require to be in the feasible region. So the corresponding dual problem of constrained optimization is:

and the optimal value is

Lemma 3: The dual problem is always convex

Proof of Lemma 3: By definition of , we have

thus is a pointwise infimum of affline functions of and . Therefore, the dual problem is a maximization of concave function.

We have optimal values from the primal problem and from the dual problem. We always have in minimization. This is called the weak duality. In case of , it has a strong duality.

Examples of deriving Lagrangian dual

Least square solution of linear equation

For vectors and and matrix ,

The Lagrangian function is:

to look for we solve for

Therefore the dual function is

Quadratic programming

Assume we have a symmetric positive semi-definite matrix

The Lagrangian function is

Deriving the dual function,

Substitute back into ,

If we relax the constraint of , we have a quadratic programming problem with equality constraints

Which then,

which we can rewrite the above equations into matrix form and solve it:

QCQP

The general form of quadratically constrained quadratic program:

If we define , , and , then the Lagrangian function is

then,

Substitute back into ,

So the corresponding dual problem is

Binary classification with support vector machine

Traditionally a binary classifier has output of . Assume the input is and output are given for data points. If the data are linearly separable, we look for maximized margin to the hyperplane which

and we use as classifier.

The optimization problem is therefore

The above are called hard margin. If we allow error in classification, we have the soft margin model:

which is the error margin and is the trade off between margin and loss. Here wrong classification is allowed when but we aimed to minimize so to avoid misclassification as much as possible.

The Lagrangian function in this case is therefore

with and . Solving the differential,

Substitute back into to find the dual function:

and the corresponding dual problem:

Solving this dual problem gives , which we can find the optimal . Note that depends only on the data points for which . Those data points are the support vectors of the model which gives the name. To find the offset of the hyperplane, we have for each ,

and we have a different for each . But we usually use the mean of all data in the result:

We prefer to solve a SVM with dual problem as it gives hyperplane parameters in closed form with only dot products.

The complementary slackness condition tells some additional information on the error and the Lagrangian multiplier:

  • In case of correct classification, , we have and therefore . In this case,

  • In case of the data point is on the hyperplane, and and . In this case,

  • In case of some margin error, , we have and . In this case

Kernel tricks

SVM works best if data are linearly separable. Otherwise the classification may give error. However, we may convert data from non-linearly separable into linearly separable by projecting data into higher dimension, through function :

But in practice, we use kernel function in the model instead of projecting all data into higher dimension before feeding into the model:

The kernel function is by definition symmetric. The SVM with kernel function becomes

and the classifier is

The following are some common kernels:

Squared kernel . In case of a 3-vector , the kernel is equivalent to

but computing is while computing is only .

Quadratic kernel . In case of a 3-vector , the kernel is equivalent to

Polynomial kernel for some positive integer . This kernel maps vector into a feature space of monomials of the form up to order . The feature space is of dimension but the kernel computation is still .

Gaussian kernel (a.k.a. radial basis function, RBF) . This can be viewed as an infinite-degree polynomial kernel.

Other common kernels:

  • Linear:
  • Two-layer neural network:

Kernel function can be written as a matrix: For vectors , we define a kernel matrix as

The kernel matrix is symmetric and positive semi-definite (i.e., for all ). This is the necessary and sufficient condition for a valid kernel, as proved by the Mercer theorem.

Application of kernel trick

Handwritten digit recognition: Input is 16x16 pixels of binary value transformed into a 256-vector, but a polynomial kernel or Gaussian kernel can give very good performance.

String matching: There are possible strings of length on case-insensitive alphabets. For , . If we build a model to classify text based on count of occurrences of all -gram from a long string, the size of vector will be intractable. But for a kernel, we can compute by dynamic programming.