Support Vector Machines

Shubham Saket
2 min readAug 16, 2019

Mathematics behind

Prerequisites:
1. Some knowledge of vectors and linear algebra.
2. Lagrange Multipliers
3. Analytical and learning attitude

Analogy:

The main concept or intuition behind SVM is to find the widest separation between two classes, here red and green dots.

Let ‘m’ be a hyper-plane separating the two classes and ‘w’ be a vector which is perpendicular to the hyper-plane.
‘xr’ be a vector representing the one of the red sample, let us say red sample in the above picture.
‘xg’ be a vector representing the one of the green sample, let us say red sample in the above picture.
For an unknown sample u we take its projection on vector w which is perpendicular to hyper-plane and make an assumption that it should be greater than some value c to be a positive sample.

Our decision condition becomes:
w.u ≥ c
w.u + b ≥ 0 where b = -c

Constraint:
For xr at the margin width w.x +b ≥ 1;
For xg at the margin width w.x +b ≤ -1.
Combined equation can be written as y.(w.x +b) ≥1 where y = 1 for xr and y =-1 for xg.

Equation for width

Width of the margin can be represented as ( xr-xg ).w/|w| i.e. dot product of difference of (xr and xg ) and unit vector perpendicular to hyper-place.
If we maximise the width function then we are through.

Boundary condition i.e. at the margin xr and xg:

y(w.xr + b) = 1
w.xr +b = 1
w.xr = 1 -b

y(w.xg + b) = 1
-w.xg -b =1
-w.xg = 1+b

width = (xr-xg).w/|w|
width = (w.xr -w.xg)/|w|
width = 2/|w|

Now to maximise width => minimise |w| => min (0.5 * |w|²)
* * min (0.5 * |w|²) this transformation is done because it is more convenient in further calculation.

Lagrange Multipliers

Using Lagrange multipliers we find the maximum width i.e. min (0.5 * |w|²)
constraint y(w.xr + b) -1 ≥ 0

L = 0.5 * |w|² — Σai.[yi(w.xi +b) -1]

we differentiate L w.r.t w and equate it to zero we get:
w = Σai.yi.xi

we differentiate L w.r.t b and equate it to zero we get:
Σai.yi = 0

Decision condition becomes:
Σai.yi.xi.u +b ≥ 0 then a red class.

Kernel Trick

A dataset which is linearly separable is hardly encountered in practical scenarios to overcome this drawback the dataset is transformed to new dimension in which it becomes linearly separable. This transformation is known as kernel trick.

We have seen that the classifier depends only the dot product of the two datapoint vectors.
Kernel is function that transforms xi.xj to Φ(xi) . Φ(xj)

In case of queries please feel free to reach me here.

Reference:-
1. https://www.youtube.com/watch?v=_PwhiWxHK8o
2.https://en.wikipedia.org/wiki/Support-vector_machine
3.https://www.researchgate.net/publication/221621494_Support_Vector_Machines_Theory_and_Applications

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response