Optimization: Loss Function Under the Hood (Part III) (2024)

Optimization: Loss Function Under the Hood (Part III) (3)

Continuing this journey, I have discussed the loss function and optimization process of linear regression at Part I, logistic regression at part II, and this time, we are heading to Support Vector Machine.

Let’s start from Linear SVM that is known as SVM without kernels. Looking at the scatter plot by two features X1, X2 as below. We actually separate two classes in many different ways, the pink line and green line are two of them. SVM ends up choosing the green line as the decision boundary, because how SVM classify samples is to find the decision boundary with the largest margin that is the largest distance from a sample who is closest to decision boundary. That’s why Linear SVM is also called Large Margin Classifier.

Optimization: Loss Function Under the Hood (Part III) (4)

Who are the support vectors? Support vector is a sample that is incorrectly classified or a sample close to a boundary. Looking at the plot below. The samples with red circles are exactly decision boundary. In SVM, only support vectors has an effective impact on model training, that is saying removing non support vector has no effect on the model at all. Why? We will figure it out from its cost function.

Optimization: Loss Function Under the Hood (Part III) (5)

The loss function of SVM is very similar to that of Logistic Regression. Looking at it by y = 1 and y = 0 separately in below plot, the black line is the cost function of Logistic Regression, and the red line is for SVM. Please note that the X axis here is the raw model output, θᵀx. Remember putting the raw model output into Sigmoid Function gives us the Logistic Regression’s hypothesis. What is the hypothesis for SVM? It’s simple and straightforward. When θᵀx ≥ 0, predict 1, otherwise, predict 0.

Optimization: Loss Function Under the Hood (Part III) (6)
Optimization: Loss Function Under the Hood (Part III) (7)

Then back to loss function plot, aka. Hinge Loss, when the actual is 1 (left plot as below), if θᵀx ≥ 1, no cost at all, if θᵀx < 1, the cost increases as the value of θᵀx decreases. Wait! When θᵀx ≥ 0, we already predict 1, which is the correct prediction. Why does the cost start to increase from 1 instead of 0? Yes, SVM gives some punishment to both incorrect predictions and those close to decision boundary ( 0 < θᵀx <1), that’s how we call them support vectors. When data points are just right on the margin, θᵀx = 1, when data points are between decision boundary and margin, 0< θᵀx <1. I will explain why some data points appear inside of margin later. As for why removing non-support vectors won’t affect model performance, we are able to answer it now. Remember model fitting process is to minimize the cost function. Since there is no cost for non-support vectors at all, the total value of cost function won’t be changed by adding or removing them.

Optimization: Loss Function Under the Hood (Part III) (8)

Let’s write the formula for SVM’s cost function:

Optimization: Loss Function Under the Hood (Part III) (9)

We can also add regularization to SVM. For example, adding L2 regularized term to SVM, the cost function changed to:

Optimization: Loss Function Under the Hood (Part III) (10)

Different from Logistic Regression using λ as the parameter in front of regularized term to control the weight of regularization, correspondingly, SVM uses C in front of fit term. Intuitively, the fit term emphasizes fit the model very well by finding optimal coefficients, and the regularized term controls the complexity of the model by constraining the large value of coefficients. There is a trade-off between fitting the model well on training dataset and the complexity of the model that may lead to overfitting, which can be adjusted by tweaking the value of λ or C. Both λ and C prioritize how much we care about optimize fit term and regularized term. Placing at different places of cost function, C actually plays a role similar to 1/λ.

With a very large value of C (similar to no regularization), this large margin classifier will be very sensitive to outliers. For example, in the plot on the left as below, the ideal decision boundary should be like green line, by adding the orange orange triangle (outlier), with a vey big C, the decision boundary will shift to the orange line to satisfy the the rule of large margin. On the other hand, C also plays a role to adjust the width of margin which enables margin violation. See the plot below on the right. When C is small, the margin is wider shown as green line. The pink data points have violated the margin. It is especially useful when dealing with non-separable dataset. So This is how regularization impact the choice of decision boundary that make the algorithm work for non-linearly separable dataset with tolerance of data points who are misclassified or have margin violation.

Optimization: Loss Function Under the Hood (Part III) (11)

When decision boundary is not linear, the structure of hypothesis and cost function stay the same. Firstly, let’s take a look.

Optimization: Loss Function Under the Hood (Part III) (12)

You may have noticed that non-linear SVM’s hypothesis and cost function are almost the same as linear SVM, except ‘x’ is replaced by ‘f’ here. f is the function of x, and I will discuss how to find the f next. Let’s tart from the very first beginning. Assume that we have one sample (see the plot below) with two features x1, x2. I randomly put a few points (l⁽¹⁾, l⁽²⁾, l⁽³⁾) around x, and called them landmarks. I would like to see how close x is to these landmarks respectively, which is noted as f1 = Similarity(x, l⁽¹⁾) or k(x, l⁽¹⁾), f2 = Similarity(x, l⁽²⁾) or k(x, l⁽²⁾), f3 = Similarity(x, l⁽³⁾) or k(x, l⁽³⁾).

Optimization: Loss Function Under the Hood (Part III) (13)

So this is called Kernel Function, and it’s exact ‘f’ that you have seen from above formula. What is it inside of the Kernel Function? In other words, how should we describe x’s proximity to landmarks? There are different types. Gaussian Kernel is one of the most popular ones. It’s calculated with Euclidean Distance of two vectors and parameter σ that describes the smoothness of the function. Gaussian kernel provides a good intuition. If x ≈ l⁽¹⁾, f1 ≈ 1, if x is far from l⁽¹⁾, f1 ≈ 0. In Scikit-learn SVM package, Gaussian Kernel is mapped to ‘rbf’ , Radial Basis Function Kernel, the only difference is ‘rbf’ uses γ to represent Gaussian’s 1/2σ² .

Optimization: Loss Function Under the Hood (Part III) (14)

We can say that the position of sample x has been re-defined by those three kernels. That is saying, Non-Linear SVM computes new features f1, f2, f3, depending on the proximity to landmarks, instead of using x1, x2 as features any more, and that is decided by the chosen landmarks. This is where the raw model output θᵀf is coming from. Let’s try a simple example. θᵀf = θ0 + θ1f1 + θ2f2 + θ3f3. Assign θ0 = -0.5, θ1 = θ2 = 1, θ3 = 0, so the θᵀf turns out to be -0.5 + f1 + f2. Looking at the first sample(S1) which is very close to l⁽¹⁾ and far from l⁽²⁾, l⁽³⁾ , with Gaussian kernel, we got f1 = 1, f2 = 0, f3 = 0, θᵀf = 0.5. According to hypothesis mentioned before, predict 1. Sample 2(S2) is far from all of landmarks, we got f1 = f2 = f3 =0, θᵀf = -0.5 < 0, predict 0. Based on current θs, it’s easy to notice that any point near to l⁽¹⁾ or l⁽²⁾ will be predicted as 1, otherwise 0. The green line demonstrates an approximate decision boundary as below.

Optimization: Loss Function Under the Hood (Part III) (15)

We have just went through the prediction part with certain features and coefficients that I manually chose. So, where are these landmarks coming from? How many landmarks do we need? Ok, it might surprise you that given m training samples, the location of landmarks is exactly the location of your m training samples.

Optimization: Loss Function Under the Hood (Part III) (16)

That is saying Non-Linear SVM recreates the features by comparing each of your training sample with all other training samples. Thus the number of features for prediction created by landmarks is the the size of training samples. For a given sample, we have updated features as below:

Optimization: Loss Function Under the Hood (Part III) (17)

Regarding to recreating features, this concept is like that when creating a polynomial regression to reach a non-linear effect, we can add some new features by making some transformations to existing features such as square it. For example, you have two features x1 and x2. To create polynomial regression, you created θ0 + θ1x1 + θ2x2 + θ3x1² + θ4x1²x2, as so your features become f1 = x1, f2 = x2, f3 = x1², f4 = x1²x2

Let’s rewrite the hypothesis, cost function, and cost function with regularization.

Optimization: Loss Function Under the Hood (Part III) (18)

To achieve a good performance of model and prevent overfitting, besides picking a proper value of regularized term C, we can also adjust σ² from Gaussian Kernel to find the balance between bias and variance. Take a certain sample x and certain landmark l as an example, when σ² is very large, the output of kernel function f is close 1, as σ² getting smaller, f moves towards to 0. In other words, with a fixed distance between x and l, a big σ² regards it ‘closer’ which has higher bias and lower variance(underfitting),while a small σ² regards it ‘further’ which has lower bias and higher variance (overfitting).

Like Logistic Regression, SVM’s cost function is convex as well. The most popular optimization algorithm for SVM is Sequential Minimal Optimization that can be implemented by ‘libsvm’ package in python. SMO solves a large quadratic programming(QP) problem by breaking them into a series of small QP problems that can be solved analytically to avoid time-consuming process to some degree. In terms of detailed calculations, It’s pretty complicated and contains many numerical computing tricks that makes computations much more efficient to handle very large training datasets.

In summary, if you have large amount of features, probably Linear SVM or Logistic Regression might be a choice. If you have small number of features (under 1000) and not too large size of training samples, SVM with Gaussian Kernel might work for you data well .

Optimization: Loss Function Under the Hood (Part III) (2024)

References

Top Articles
How did Lake Nyos suddenly kill 1,700 people?
This Small Lake in Africa Once Killed 1,700 People Overnight, And We Still Don't Know Why
Enrique Espinosa Melendez Obituary
Walgreens Pharmqcy
Don Wallence Auto Sales Vehicles
Air Canada bullish about its prospects as recovery gains steam
Shorthand: The Write Way to Speed Up Communication
Nc Maxpreps
Craigslist Cars And Trucks Buffalo Ny
Shaniki Hernandez Cam
Sinai Web Scheduler
Detroit Lions 50 50
Rapv Springfield Ma
Ts Lillydoll
Letter F Logos - 178+ Best Letter F Logo Ideas. Free Letter F Logo Maker. | 99designs
Navy Female Prt Standards 30 34
Tvtv.us Duluth Mn
Ess.compass Associate Login
Publix Super Market At Rainbow Square Shopping Center Dunnellon Photos
Brazos Valley Busted Newspaper
College Basketball Picks: NCAAB Picks Against The Spread | Pickswise
Prep Spotlight Tv Mn
Xpanas Indo
Scott Surratt Salary
NV Energy issues outage watch for South Carson City, Genoa and Glenbrook
Tottenham Blog Aggregator
Darktide Terrifying Barrage
Tu Housing Portal
Gncc Live Timing And Scoring
Martins Point Patient Portal
R/Orangetheory
Transformers Movie Wiki
Housing Assistance Rental Assistance Program RAP
Sedano's Supermarkets Expands to Orlando - Sedano's Supermarkets
Sinfuldeeds Vietnamese Rmt
AsROck Q1900B ITX und Ramverträglichkeit
Daily Jail Count - Harrison County Sheriff's Office - Mississippi
Leatherwall Ll Classifieds
Wattengel Funeral Home Meadow Drive
Felix Mallard Lpsg
Let's co-sleep on it: How I became the mom I swore I'd never be
Tgirls Philly
Craigslist Food And Beverage Jobs Chicago
Big Reactors Best Coolant
The Horn Of Plenty Figgerits
Tommy Bahama Restaurant Bar & Store The Woodlands Menu
Tropical Smoothie Address
Meet Robert Oppenheimer, the destroyer of worlds
Germany’s intensely private and immensely wealthy Reimann family
Morbid Ash And Annie Drew
Best brow shaping and sculpting specialists near me in Toronto | Fresha
Suzanne Olsen Swift River
Latest Posts
Article information

Author: Zonia Mosciski DO

Last Updated:

Views: 6324

Rating: 4 / 5 (51 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Zonia Mosciski DO

Birthday: 1996-05-16

Address: Suite 228 919 Deana Ford, Lake Meridithberg, NE 60017-4257

Phone: +2613987384138

Job: Chief Retail Officer

Hobby: Tai chi, Dowsing, Poi, Letterboxing, Watching movies, Video gaming, Singing

Introduction: My name is Zonia Mosciski DO, I am a enchanting, joyous, lovely, successful, hilarious, tender, outstanding person who loves writing and wants to share my knowledge and understanding with you.