支持向量的数量与训练数据和分类器性能之间的关系是什么? [英] What is the relation between the number of Support Vectors and training data and classifiers performance?

查看:169
本文介绍了支持向量的数量与训练数据和分类器性能之间的关系是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用LibSVM对一些文档进行分类.正如最终结果所示,这些文档似乎很难归类.但是,在训练模型时,我注意到了一些东西.就是说:如果我的训练集例如是1000,则其中约800个被选为支持向量. 我到处都在寻找这是好事还是坏事.我的意思是支持向量的数量和分类器的性能之间有关系吗? 我已阅读此帖子上一篇文章.但是,我正在执行参数选择,并且我还要确保特征向量中的属性都是有序的. 我只需要知道这种关系. 谢谢. ps:我使用线性核.

I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800 of them are selected as support vectors. I have looked everywhere to find if this is a good thing or bad. I mean is there a relation between the number of support vectors and the classifiers performance? I have read this post previous post. However, I am performing a parameter selection and also I am sure that the attributes in the feature vectors are all ordered. I just need to know the relation. Thanks. p.s: I use a linear kernel.

推荐答案

支持向量机是一个优化问题.他们试图找到一种将两个类别划分为最大裕度的超平面.支持向量是落在该裕度内的点.如果将其从简单构建为更复杂,最容易理解.

Support Vector Machines are an optimization problem. They are attempting to find a hyperplane that divides the two classes with the largest margin. The support vectors are the points which fall within this margin. It's easiest to understand if you build it up from simple to more complex.

硬边距线性SVM

在训练集中,数据是线性可分离的,并且您使用硬边距(不允许松弛),支持向量是沿支持超平面(与超分割平面平行的超平面)的点.边缘的边缘)

In an a training set where the data is linearly separable, and you are using a hard margin (no slack allowed), the support vectors are the points which lie along the supporting hyperplanes (the hyperplanes parallel to the dividing hyperplane at the edges of the margin)

所有支持向量都恰好位于边缘.无论数据集的维数或大小如何,支持向量的数量都可以少至2.

All of the support vectors lie exactly on the margin. Regardless of the number of dimensions or size of data set, the number of support vectors could be as little as 2.

软边距线性SVM

但是如果我们的数据集不是线性可分离的怎么办?我们介绍了软边际支持向量机.我们不再要求我们的数据点位于边距之外,我们允许其中一些数据点跨越线进入边距.我们使用松弛参数C来控制它. (nu-SVM中的nu)这给我们带来了更大的裕度和训练数据集上的更大误差,但是可以提高泛化能力和/或允许我们找到不可线性分离的数据的线性分离.

But what if our dataset isn't linearly separable? We introduce soft margin SVM. We no longer require that our datapoints lie outside the margin, we allow some amount of them to stray over the line into the margin. We use the slack parameter C to control this. (nu in nu-SVM) This gives us a wider margin and greater error on the training dataset, but improves generalization and/or allows us to find a linear separation of data that is not linearly separable.

现在,支持向量的数量取决于我们允许的余量和数据的分布.如果我们允许大量的松弛,我们将有大量的支持向量.如果我们允许很少的松弛,我们将只有很少的支持向量.准确性取决于为要分析的数据找到合适的松弛水平.有些数据将无法获得很高的准确性,我们必须简单地找到最合适的数据.

Now, the number of support vectors depends on how much slack we allow and the distribution of the data. If we allow a large amount of slack, we will have a large number of support vectors. If we allow very little slack, we will have very few support vectors. The accuracy depends on finding the right level of slack for the data being analyzed. Some data it will not be possible to get a high level of accuracy, we must simply find the best fit we can.

非线性SVM

这使我们进入了非线性SVM.我们仍在尝试线性分割数据,但现在尝试在更高维度的空间中进行分割.这是通过内核函数完成的,内核函数当然具有自己的参数集.当我们将其转换回原始特征空间时,结果是非线性的:

This brings us to non-linear SVM. We are still trying to linearly divide the data, but we are now trying to do it in a higher dimensional space. This is done via a kernel function, which of course has its own set of parameters. When we translate this back to the original feature space, the result is non-linear:

现在,支持向量的数量仍然取决于我们允许的松弛量,但也取决于模型的复杂性.输入空间中最终模型的每一次扭曲都需要一个或多个支持向量来定义.最终,SVM的输出是支持向量和alpha,这本质上是定义特定支持向量对最终决策有多大影响.

Now, the number of support vectors still depends on how much slack we allow, but it also depends on the complexity of our model. Each twist and turn in the final model in our input space requires one or more support vectors to define. Ultimately, the output of an SVM is the support vectors and an alpha, which in essence is defining how much influence that specific support vector has on the final decision.

在这里,准确性取决于可能使数据过度拟合的高复杂度模型与为更好地概括而可能对某些训练数据进行错误分类的大范围之间的权衡.如果您完全过度拟合数据,则支持向量的数量范围可能从很少到每个数据点.这种权衡是通过C以及通过选择内核和内核参数来控制的.

Here, accuracy depends on the trade-off between a high-complexity model which may over-fit the data and a large-margin which will incorrectly classify some of the training data in the interest of better generalization. The number of support vectors can range from very few to every single data point if you completely over-fit your data. This tradeoff is controlled via C and through the choice of kernel and kernel parameters.

我假设当您说性能时,您指的是准确性,但是我想我也会在计算复杂度方面谈论性能.为了使用SVM模型测试数据点,您需要计算每个支持向量与测试点的点积.因此,模型的计算复杂度在支持向量的数量上是线性的.支持向量越少,意味着测试点的分类速度越快.

I assume when you said performance you were referring to accuracy, but I thought I would also speak to performance in terms of computational complexity. In order to test a data point using an SVM model, you need to compute the dot product of each support vector with the test point. Therefore the computational complexity of the model is linear in the number of support vectors. Fewer support vectors means faster classification of test points.

良好的资源: 关于用于模式的支持向量机的教程识别

这篇关于支持向量的数量与训练数据和分类器性能之间的关系是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆