当训练样本数量增加时,一类支持向量机灵敏度下降 [英] One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

查看:256
本文介绍了当训练样本数量增加时,一类支持向量机灵敏度下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一类SVM进行离群值检测.看来,随着训练样本数量的增加,一类SVM检测结果的灵敏度TP/(TP + FN)下降,分类率和特异性都增加.

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.

用超平面和支持向量解释这种关系的最佳方法是什么?

What's the best way of explaining this relationship in terms of hyperplane and support vectors?

谢谢

推荐答案

您拥有的训练示例越多,分类器就无法正确检测出真实的阳性结果.

The more training examples you have, the less your classifier is able to detect true positive correctly.

这意味着新数据与您正在训练的模型不正确匹配.

It means that the new data does not fit correctly with the model you are training.

这是一个简单的例子.

Here is a simple example.

下面有两个类,我们可以使用线性内核轻松地将它们分开. 蓝色等级的灵敏度为1.

Below you have two classes, and we can easily separate them using a linear kernel. The sensitivity of the blue class is 1.

当我在决策边界附近添加更多黄色训练数据时,生成的超平面无法像以前那样适合数据.

As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.

因此,我们现在看到有两个错误分类的蓝色数据点. 蓝色等级的敏感度现在为0.92

As a consequence we now see that there is two misclassified blue data point. The sensitivity of the blue class is now 0.92

随着训练数据数量的增加,支持向量会生成稍微不太理想的超平面.也许由于额外的数据,线性可分离的数据集变得非线性不可分离.在这种情况下,尝试使用其他内核(例如RBF内核)会有所帮助.

As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.

添加有关RBF内核的更多信息:

在此视频中,您可以看到RBF内核发生了什么. 同样的逻辑也适用,如果训练数据不容易按n维分离,则结果会更糟.

In this video you can see what happen with a RBF kernel. The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.

您应该尝试使用交叉验证来选择更好的C.

You should try to select a better C using cross-validation.

本文中,图3说明,如果C的选择不正确:

In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :

如果我们没有选择合适的C,更多的训练数据可能会受到伤害.我们需要 在正确的C上交叉验证以产生良好的结果

More training data could hurt if we did not pick a proper C. We need to cross-validate on the correct C to produce good results

这篇关于当训练样本数量增加时,一类支持向量机灵敏度下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆