# Programming A Driverless Car Chapter 8: Support Vector Machine

Support vector Machine
We will learn here which is the best decision boundary dividing between data sets. The best decision is taken based on the outermost points which we call support vectors. The most outermost or the nearest one decide where should the decision boundary lie.
For unknown:
Positive vector- w.x+b >=0
Negative vector- w.x+b <=0
Decision Rule- w.x+b =0
For known:
Positive Class- w.x+b >=1
Negative Class- w.x+b <= -1
yi   yi= +1 — positive class
yi= -1 –negative class
Final equation for a particular data set will be: y(w.x+b)>= 1
y(w.x+b ) -1>=0

Our main motto is to maximize the distance between the two red lines. If it is done, which has equation: w.x+b =0, maximize the value of w. Then we need to find out margin value.
If the first line is x- and the second one is x+
Then the difference will be x(+)- (x-)
Decision Function for binary Classification:

First one is positive sample and second one is negative sample.
Support Vector Machine(SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data(supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

• SVM picks the best separating hyperplane according to criterion. Ex. maximum margin
• Training process is an optimisation
• Training set is effectively reduced to a relatively small number of support vectors.

Feature spaces:

• We amy separate data by mapping to a higher dimensional feature space
• The feature space may even have an infinite number of dimensions
• We need not explicitly construct the new feature space

Binary Classification can be viewed as the task of separating classes in feature space:

We have to also decide which hyperplane is the best that we can get?
Here:

Here the blue one is the best hyperplane. We decide this by the outermost points lying which we term as support vectors.

These red dots are the support vectors and the difference between their distances is measured by the hyperplanes.
Large Margin Linear Classifier:

Classification Margin:

• Distance from example data to the separator is:
• Data closest to the hyperplane are support vectors.
• Margin of the separator is the width of separation between classes

Maximizing Margin Classification:

• Maximizing the margin is good according to intuition and theory
• Implies that only support vectors are important, other training examples are ignorable.

Skinny margins are flexible but complex. Fat margin is more easy.
Linear SVM Mathematically:

• Assuming that all the data is at distance larger than 1 from the hyperplane, the following two constraints follow for a training set {(x(i),y(i))}
• For support vectors, the inequality becomes an equality.
• Then we can formulate the quadratic optimization problem:

Need to optimize a quadratic function subject to linear constraints

• Quadratic optimization problems are well known class of mathematical programming problems,and many algorithms exist for solving them.
• The solution involves constructing a dual problem where a lagrange multiplier is associated with every constraint in the primary problem

• The solution has the form:

• Each non zero alpha(i) indicates that corresponding x(i) is a support vector.
• Then the classifying function will have the form:

• Notice that it relies on an inner product between the test points x and the support vectors x(i)

Margin Classification:
What if training set is not linearly separable?
Then we take Slack Variable which can allow misclassification of difficult or noisy examples.

The new formula will be:

Parameter C can be viewed as a way to control overfitting

• The dual problem for soft margin classification:

• Neither Slack variables nor their lagrange multipliers appear in the dual problem
• Again x(i) with non zero will be support vectors
• Solution to dual problems:

Soft vs Hard margin:
Soft margin always have a solution.
Soft margin is more robust to outliers-smoother surfaces(in the non-linear case)
Hard margin does not require to guess the cost parameter( requires no parameter at all)
Theoretical Justification for Maximum Margins:

• Vapnik has proved:
The class of optimal linear separators has VC dimension h bounded from above as:

where δ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples,and m0 is the dimensionality.

• Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC dimension by maximizing the margin δ.
• Thus, complexity of the classifier is kept small.

In Linear SVMs

• The classifier is a separating hyperplane
• Most “important” training points are support vectors, they define the hyperplane.
• Both in the dual formula of the problem and in the solution training points appear only inside inner products:

SVM Applications:

• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increased popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
• SVM techniques have been extended to a number  of tasks such  as regression principal component analysis

Non linear Support vector machine

• Support Vector Machine
• Linear support vector machine
• Mathematical model

This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.
The transformation may be non linear and the transformed space high dimensional though the classifier is a hyperplane in the high-dimensional  feature space , it may be non-linear in the original input space.

• Datasets that are linearly separable with some noise work out great:
• But what we are going to do if the dataset is just too hard?
• How about mapping data to a higher-dimensional space?

• The original feature space can always be mapped to some higher dimensional feature space where the training set is separable:

• So in nonlinear classification:

Kernels:

• We may use kernel functions to implicitly map to  a new feature space\
• Kernel fn:
• Kernel must be equivalent to an inner product in some feature space

Example of kernels:

LAB: Cancer Recognition
`Import numpy`
`from sklearn import svm`
`Import matplotlib.pyplot as plt`
#sample data
`x=numpy.sort(8*numpy.random.rand(40,1),axis=0)`
`y=numpy.sin(x).ravel()`
`y[::5]+=3*(0.5-numpy.random.rand(8))`
#defining the algorithms
`svr_rbf=svm.SVR(kernel= ‘rbf’, C=1e3, gamma=0.1)`
`svr_lin=svm.SVR(kernel= ‘linear’, C=1e3)`
`svr+poly=svm.SVR(kernel= ‘poly’, C=1e3, degree=2)`
#use algorithm to get y
`y_rbf=svr_rbf.fit(x,y).predict(x)`
`y_linear=svr_lin.fit(x,y).predict(x)`
`y_poly=svr_rbf.fit(x,y).predict(x)`
`lw=2`
`plt.scatter(x,y,color= ‘orange’, label= ‘data’)`
`plt.hold( ‘on’)`
`plt.plot(x,y_rbf,color= ‘navy’, lw=lw,label= ‘RBF Model’)`
`plt.plot(x,y_lin,color= ‘blue’, lw=lw,label= ‘Linear Model’)`
`plt.plot(x,y_poly,color= ‘green’, lw=lw,label= ‘Polynomial Model’)`
`plt.xlabel(‘data’)`
`plt.ylabel(‘target’)`
`plt.title(‘Support Vector Regression’)`
`plt.legend()`
`plt.show()`
OUTPUT:
We will get our data set in the form of a graph. RBF model is much better than other models.

Regression Problem Example:
We will work on practical problem of cancer recognition on data set available online.

Data set values
`Import numpy`
`Import pandas`
`Import sklearn import svm,cross_vaidation`
`data=pandas.read_csv(r‘C:\USers\Robot\Documents\cancerdata\breast-cancer-wisconsin.data.txt’)`
#name the attributes as given in the website
`data.drop([‘Id’],1,inplace=True)`
`data.replace(‘?’,-9999,inplace=True)`
`x=numpy.array(data.drop([‘Class’],1))`
`y=numpy.array(data[‘Class’])`
`xtrain,xtest,ytrain,ytest,ytest=cross_validation.train_test_split(x,y,test_size=0.2)`
`alg=svm.SVC()`
`alg.fit(xtrain,ytrain)`
`accuracy=alg.score(xtest,ytest)`
`print accuracy`
OUTPUT:
we are getting accuracy=0.9785714
But we need more accuracy.
#for prediction take data of a patient
`patent=numpy.array([4,2,4,2,1,1,3,2,3])`
`patient=patient.reshape(1,-1)`
`prediction=alg.predict(patient)`
`print prediction`
OUTPUT: [2]
Character Recognition using SVM
`>>>import matplotlib`
`>>>import sklearn`
No error- Both libraries are installed
i`mport matplotlib.pyplot as plt`
`from sklearn import svm`
`from sklearn import datasets`
`digitsdata=datasets.load_digits()`
#it contains 3 values:Matrix values, target values and images
`print digitsdata.data`
`print digitsdata.target`
OUTPUT:

Matrix Values
——————————————————————————————————–
`import matplotlib.pyplot as plt`
`from sklearn import svm`
`from sklearn import datasets`
`digitsdata=datasets.load_digits()`
#it contains 3 values:Matrix values, target values and images
`print digitsdata.data`
`print digitsdata.target`
`plt.imshow(digitsdata.images[5],cmap=plt.cm.gray_r,interpolation= ‘nearest’)`
`plt.show()`

—————————————————————————————————————
`import matplotlib.pyplot as plt`
`from sklearn import svm`
`from sklearn import datasets`
`digitsdata=datasets.load_digits()`
#it contains 3 values:Matrix values, target values and images
`print digitsdata.data`
`print digitsdata.target`
#to train algorithms
`alg=svm.SVC(gamma=0.001,C=100)`
`X,Y=digitsdata .data[:100],digitsdata.target[:100]`
`alg.fit(X,Y)`
`print len(digitsdata.target)`
`print(alg.predict(digitsdata.data[120]))`
`plt.imshow(digitsdata.images[120],cmap=plt.cm.gray_r,interpolation= ‘nearest’)`
`plt.show()`
we have 1797 data sets
[5]
Correct prediction
Keep checking with various values.

Recommended Reading: Programming A Driverless Car Chapter 7: Natural Language Processing
Next: Chapter 9: