This is a summary of the math behind the support vector machine. Start from the constrained optimization problem, solving one using Lagrangian function, and application to support vector machine.
Lagrangian function and primal-dual
A general (primal) optimization problem can be prosed as follows:
This is a standard form, using min and .
The corresponding Lagrangian function is
where and are Lagrangian multipliers and as they are corresponding to inequalities upperbounded by 0, and can be positive or negative.
Lemma 1: Any feasible satisfies and the supremum attains at
Proof of Lemma 1: is feasible if and . Thus we have
and the equality holds iff for all .
The above introduced the Karush-Kuhn-Tucker optimality condition. We will use this a lot in solving the Lagrangian function for optimality.
Lemma 2: Optimal value satisfies
Proof of Lemma 2: Any feasible satisfies . Thus by substitution we have .
in some cases, we can swap the operator to have
which we defined the Lagrangian dual function as
The infimum does not require to be in the feasible region. So the corresponding dual problem of constrained optimization is:
and the optimal value is
Lemma 3: The dual problem is always convex
Proof of Lemma 3: By definition of , we have
thus is a pointwise infimum of affline functions of and . Therefore, the dual problem is a maximization of concave function.
We have optimal values from the primal problem and from the dual problem. We always have in minimization. This is called the weak duality. In case of , it has a strong duality.
Examples of deriving Lagrangian dual
Least square solution of linear equation
For vectors and and matrix ,
The Lagrangian function is:
to look for we solve for
Therefore the dual function is
Assume we have a symmetric positive semi-definite matrix
The Lagrangian function is
Deriving the dual function,
Substitute back into ,
If we relax the constraint of , we have a quadratic programming problem with equality constraints
which we can rewrite the above equations into matrix form and solve it:
The general form of quadratically constrained quadratic program:
If we define , , and , then the Lagrangian function is
Substitute back into ,
So the corresponding dual problem is
Binary classification with support vector machine
Traditionally a binary classifier has output of . Assume the input is and output are given for data points. If the data are linearly separable, we look for maximized margin to the hyperplane which
and we use as classifier.
The optimization problem is therefore
The above are called hard margin. If we allow error in classification, we have the soft margin model:
which is the error margin and is the trade off between margin and loss. Here wrong classification is allowed when but we aimed to minimize so to avoid misclassification as much as possible.
The Lagrangian function in this case is therefore
with and . Solving the differential,
Substitute back into to find the dual function:
and the corresponding dual problem:
Solving this dual problem gives , which we can find the optimal . Note that depends only on the data points for which . Those data points are the support vectors of the model which gives the name. To find the offset of the hyperplane, we have for each ,
and we have a different for each . But we usually use the mean of all data in the result:
We prefer to solve a SVM with dual problem as it gives hyperplane parameters in closed form with only dot products.
The complementary slackness condition tells some additional information on the error and the Lagrangian multiplier:
In case of correct classification, , we have and therefore . In this case,
In case of the data point is on the hyperplane, and and . In this case,
In case of some margin error, , we have and . In this case
SVM works best if data are linearly separable. Otherwise the classification may give error. However, we may convert data from non-linearly separable into linearly separable by projecting data into higher dimension, through function :
But in practice, we use kernel function in the model instead of projecting all data into higher dimension before feeding into the model:
The kernel function is by definition symmetric. The SVM with kernel function becomes
and the classifier is
The following are some common kernels:
Squared kernel . In case of a 3-vector , the kernel is equivalent to
but computing is while computing is only .
Quadratic kernel . In case of a 3-vector , the kernel is equivalent to
Polynomial kernel for some positive integer . This kernel maps vector into a feature space of monomials of the form up to order . The feature space is of dimension but the kernel computation is still .
Gaussian kernel (a.k.a. radial basis function, RBF) . This can be viewed as an infinite-degree polynomial kernel.
Other common kernels:
- Two-layer neural network:
Kernel function can be written as a matrix: For vectors , we define a kernel matrix as
The kernel matrix is symmetric and positive semi-definite (i.e., for all ). This is the necessary and sufficient condition for a valid kernel, as proved by the Mercer theorem.
Application of kernel trick
Handwritten digit recognition: Input is 16x16 pixels of binary value transformed into a 256-vector, but a polynomial kernel or Gaussian kernel can give very good performance.
String matching: There are possible strings of length on case-insensitive alphabets. For , . If we build a model to classify text based on count of occurrences of all -gram from a long string, the size of vector will be intractable. But for a kernel, we can compute by dynamic programming.