Programming A Driverless Car Chapter 8: Support Vector Machine

October 7, 2017

Support vector Machine
We will learn here which is the best decision boundary dividing between data sets. The best decision is taken based on the outermost points which we call support vectors. The most outermost or the nearest one decide where should the decision boundary lie.
For unknown:
Positive vector- w.x+b >=0
Negative vector- w.x+b <=0
Decision Rule- w.x+b =0
For known:
Positive Class- w.x+b >=1
Negative Class- w.x+b <= -1
yi yi= +1 — positive class
yi= -1 –negative class
Final equation for a particular data set will be: y(w.x+b)>= 1
y(w.x+b ) -1>=0

Our main motto is to maximize the distance between the two red lines. If it is done, which has equation: w.x+b =0, maximize the value of w. Then we need to find out margin value.
If the first line is x- and the second one is x+
Then the difference will be x(+)- (x-)
Decision Function for binary Classification:

First one is positive sample and second one is negative sample.
Support Vector Machine(SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data(supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

SVM picks the best separating hyperplane according to criterion. Ex. maximum margin
Training process is an optimisation
Training set is effectively reduced to a relatively small number of support vectors.

Feature spaces:

We amy separate data by mapping to a higher dimensional feature space
The feature space may even have an infinite number of dimensions
We need not explicitly construct the new feature space

Binary Classification can be viewed as the task of separating classes in feature space:

We have to also decide which hyperplane is the best that we can get?
Here:

Here the blue one is the best hyperplane. We decide this by the outermost points lying which we term as support vectors.

These red dots are the support vectors and the difference between their distances is measured by the hyperplanes.
Large Margin Linear Classifier:

Classification Margin:

Distance from example data to the separator is:
Data closest to the hyperplane are support vectors.
Margin of the separator is the width of separation between classes

Maximizing Margin Classification:

Maximizing the margin is good according to intuition and theory
Implies that only support vectors are important, other training examples are ignorable.

Skinny margins are flexible but complex. Fat margin is more easy.
Linear SVM Mathematically:

Assuming that all the data is at distance larger than 1 from the hyperplane, the following two constraints follow for a training set {(x(i),y(i))}
For support vectors, the inequality becomes an equality.
Then we can formulate the quadratic optimization problem:

Need to optimize a quadratic function subject to linear constraints

Quadratic optimization problems are well known class of mathematical programming problems,and many algorithms exist for solving them.
The solution involves constructing a dual problem where a lagrange multiplier is associated with every constraint in the primary problem

The solution has the form:

Each non zero alpha(i) indicates that corresponding x(i) is a support vector.

Then the classifying function will have the form:

Notice that it relies on an inner product between the test points x and the support vectors x(i)

Margin Classification:
What if training set is not linearly separable?
Then we take Slack Variable which can allow misclassification of difficult or noisy examples.

The new formula will be:

Parameter C can be viewed as a way to control overfitting

The dual problem for soft margin classification:

Neither Slack variables nor their lagrange multipliers appear in the dual problem
Again x(i) with non zero will be support vectors
Solution to dual problems:

Soft vs Hard margin:
Soft margin always have a solution.
Soft margin is more robust to outliers-smoother surfaces(in the non-linear case)
Hard margin does not require to guess the cost parameter( requires no parameter at all)
Theoretical Justification for Maximum Margins:

Vapnik has proved:
The class of optimal linear separators has VC dimension h bounded from above as:

where δ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples,and m0 is the dimensionality.

Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC dimension by maximizing the margin δ.
Thus, complexity of the classifier is kept small.

In Linear SVMs

The classifier is a separating hyperplane
Most “important” training points are support vectors, they define the hyperplane.
Both in the dual formula of the problem and in the solution training points appear only inside inner products:

SVM Applications:

SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increased popularity in late 1990s.
SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
SVM techniques have been extended to a number of tasks such as regression principal component analysis

Non linear Support vector machine

Support Vector Machine
Linear support vector machine
Mathematical model

This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.
The transformation may be non linear and the transformed space high dimensional though the classifier is a hyperplane in the high-dimensional feature space , it may be non-linear in the original input space.

Datasets that are linearly separable with some noise work out great:

But what we are going to do if the dataset is just too hard?

How about mapping data to a higher-dimensional space?

The original feature space can always be mapped to some higher dimensional feature space where the training set is separable:

So in nonlinear classification:

Kernels:

We may use kernel functions to implicitly map to a new feature space\
Kernel fn:
Kernel must be equivalent to an inner product in some feature space

Example of kernels:

LAB: Cancer Recognition
Import numpy
from sklearn import svm
Import matplotlib.pyplot as plt
#sample data
x=numpy.sort(8*numpy.random.rand(40,1),axis=0)
y=numpy.sin(x).ravel()
#add some noise in y
y[::5]+=3*(0.5-numpy.random.rand(8))
#defining the algorithms
svr_rbf=svm.SVR(kernel= ‘rbf’, C=1e3, gamma=0.1)
svr_lin=svm.SVR(kernel= ‘linear’, C=1e3)
svr+poly=svm.SVR(kernel= ‘poly’, C=1e3, degree=2)
#use algorithm to get y
y_rbf=svr_rbf.fit(x,y).predict(x)
y_linear=svr_lin.fit(x,y).predict(x)
y_poly=svr_rbf.fit(x,y).predict(x)
lw=2
plt.scatter(x,y,color= ‘orange’, label= ‘data’)
plt.hold( ‘on’)
plt.plot(x,y_rbf,color= ‘navy’, lw=lw,label= ‘RBF Model’)
plt.plot(x,y_lin,color= ‘blue’, lw=lw,label= ‘Linear Model’)
plt.plot(x,y_poly,color= ‘green’, lw=lw,label= ‘Polynomial Model’)
plt.xlabel(‘data’)
plt.ylabel(‘target’)
plt.title(‘Support Vector Regression’)
plt.legend()
plt.show()
OUTPUT:
We will get our data set in the form of a graph. RBF model is much better than other models.

Regression Problem Example:
We will work on practical problem of cancer recognition on data set available online.

Data set values
Import numpy
Import pandas
Import sklearn import svm,cross_vaidation
data=pandas.read_csv(r‘C:\USers\Robot\Documents\cancerdata\breast-cancer-wisconsin.data.txt’)
#name the attributes as given in the website
data.drop([‘Id’],1,inplace=True)
data.replace(‘?’,-9999,inplace=True)
x=numpy.array(data.drop([‘Class’],1))
y=numpy.array(data[‘Class’])
xtrain,xtest,ytrain,ytest,ytest=cross_validation.train_test_split(x,y,test_size=0.2)
alg=svm.SVC()
alg.fit(xtrain,ytrain)
accuracy=alg.score(xtest,ytest)
print accuracy
OUTPUT:
we are getting accuracy=0.9785714
But we need more accuracy.
#for prediction take data of a patient
patent=numpy.array([4,2,4,2,1,1,3,2,3])
patient=patient.reshape(1,-1)
prediction=alg.predict(patient)
print prediction
OUTPUT: [2]
Character Recognition using SVM
>>>import matplotlib
>>>import sklearn
No error- Both libraries are installed
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import datasets
digitsdata=datasets.load_digits()
#it contains 3 values:Matrix values, target values and images
print digitsdata.data
print digitsdata.target
OUTPUT:

Matrix Values
——————————————————————————————————–
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import datasets
digitsdata=datasets.load_digits()
#it contains 3 values:Matrix values, target values and images
print digitsdata.data
print digitsdata.target
plt.imshow(digitsdata.images[5],cmap=plt.cm.gray_r,interpolation= ‘nearest’)
plt.show()

—————————————————————————————————————
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import datasets
digitsdata=datasets.load_digits()
#it contains 3 values:Matrix values, target values and images
print digitsdata.data
print digitsdata.target
#to train algorithms
alg=svm.SVC(gamma=0.001,C=100)
X,Y=digitsdata .data[:100],digitsdata.target[:100]
alg.fit(X,Y)
print len(digitsdata.target)
print(alg.predict(digitsdata.data[120]))
plt.imshow(digitsdata.images[120],cmap=plt.cm.gray_r,interpolation= ‘nearest’)
plt.show()
we have 1797 data sets
[5]
Correct prediction
Keep checking with various values.

Recommended Reading: Programming A Driverless Car Chapter 7: Natural Language Processing
Next: Chapter 9:
To start reading from a topic of your choice, you can go back to the Table of Contents here
This course evolves with your active feedback. Do let us know your feedback in the comments section below.
Looking for jobs in Artificial Intelligence? Check here