我应该如何使用具有大量类别不成比例的数据来教授机器学习算法?(支持向量机) [英] How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

查看:22
本文介绍了我应该如何使用具有大量类别不成比例的数据来教授机器学习算法?(支持向量机)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用看到横幅的人的点击和转化数据来教授我的 SVM 算法.主要问题是点击量占所有数据的 0.2% 左右,因此在其中存在很大的不成比例.当我在测试阶段使用简单的 SVM 时,它总是只预测查看"类,而从不点击"或转换".平均而言,它给出了 99.8% 的正确答案(因为不成比例),但如果您选中点击"或转化"选项,它会给出 0% 的正确预测.您如何调整 SVM 算法(或选择另一个)以考虑不成比例?

I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?

推荐答案

这里最基本的方法是使用所谓的类权重方案"——在经典的 SVM 公式中,使用了一个 C 参数控制误分类计数.可以改成C1C2参数分别用于类1和类2.对于给定的 CC1C2 的最常见选择是放置

The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a given C is to put

C1 = C / n1
C2 = C / n2

其中 n1n2 分别是第 1 类和第 2 类的大小.因此,您惩罚"SVM 对频率较低的类别进行错误分类要比对最常见的类别进行错误分类要困难得多.

where n1 and n2 are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.

许多现有的库(如 libSVM)通过 class_weight 参数支持这种机制.

Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.

示例使用 python 和 sklearn

Example using python and sklearn

print __doc__

import numpy as np
import pylab as pl
from sklearn import svm

# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
          0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)

# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]


# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)

ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]

# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()

pl.axis('tight')
pl.show()

特别是,在 sklearn 中,您可以通过设置 class_weight='auto' 来简单地打开自动加权.

In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'.

这篇关于我应该如何使用具有大量类别不成比例的数据来教授机器学习算法?(支持向量机)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆