我应该如何使用类别差异较大的数据教授机器学习算法? (支持向量机) [英] How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)
问题描述
我正在尝试使用看到横幅广告的人的点击数据和转化数据来教授我的SVM算法.主要问题是点击次数占所有数据的0.2%左右,因此其中的比例差距很大.当我在测试阶段使用简单的SVM时,它总是只预测视图"类,而从不预测单击"或转换".平均而言,它会给出99.8%的正确答案(由于不成比例),但是如果您选择点击"或转换",则给出0%的正确预测.如何调整SVM算法(或选择另一种算法)以考虑不均衡性?
I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?
推荐答案
此处最基本的方法是使用所谓的类加权方案"-在经典SVM公式中,有一个C
参数用于控制误分类计数.可以将其更改为分别用于1类和2类的C1
和C2
参数.对于给定的C
,C1
和C2
的最常见选择是放置
The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C
parameter used to control the missclassification count. It can be changed into C1
and C2
parameters used for class 1 and 2 respectively. The most common choice of C1
and C2
for a given C
is to put
C1 = C / n1
C2 = C / n2
其中,n1
和n2
分别是1类和2类的大小.因此,您可以惩罚" SVM,以便较难分类的频率较低的类比最常见的错误分类的.
where n1
and n2
are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.
许多现有的库(例如 libSVM )都通过class_weight参数支持此机制.
Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.
示例,使用python和sklearn
Example using python and sklearn
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm
# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]
# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()
pl.axis('tight')
pl.show()
尤其是在 sklearn 中,您可以通过设置class_weight='auto'
来简单地打开自动加权.
In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'
.
这篇关于我应该如何使用类别差异较大的数据教授机器学习算法? (支持向量机)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!