scikit-learn的svm的predict_proba的令人困惑的概率 [英] Confusing probabilities of the predict_proba of scikit-learn's svm

查看:1169
本文介绍了scikit-learn的svm的predict_proba的令人困惑的概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目的是根据特定类别的每个样本的排序概率绘制PR曲线.但是,我发现当我使用两个不同的标准数据集(虹膜和数字)时,svm的predict_proba()获得的概率具有两种不同的行为.

My purpose is to draw the PR curve by the sorted probability of each sample for a specific class. However, I found that the obtained probabilities by svm's predict_proba() have two different behaviors when I use two different standard datasets: the iris and digits.

第一种情况是通过下面的python代码使用"iris"情况进行评估的,它可以合理地使该类获得最高的概率.

The first case is evaluated with the "iris" case with the python code below, and it works reasonably that the class gets the highest probability.

D = datasets.load_iris()
clf = SVC(kernel=chi2_kernel, probability=True).fit(D.data, D.target)
output_predict = clf.predict(D.data)
output_proba = clf.predict_proba(D.data)
output_decision_function = clf.decision_function(D.data)
output_my = proba_to_class(output_proba, clf.classes_)

print D.data.shape, D.target.shape
print "target:", D.target[:2]
print "class:", clf.classes_
print "output_predict:", output_predict[:2]
print "output_proba:", output_proba[:2]

接下来,它将产生如下输出.显然,每个样本的最高概率与predict()的输出匹配:样本#1的0.97181088和样本#2的0.96961523.

Next, it produces the outputs like below. Apparently, the highest probability of each sample match the outputs of the predict(): The 0.97181088 for sample #1 and 0.96961523 for sample #2.

(150, 4) (150,)
target: [0 0]
class: [0 1 2]
output_predict: [0 0]
output_proba: [[ 0.97181088  0.01558693  0.01260218]
[ 0.96961523  0.01702481  0.01335995]]

但是,当我使用以下代码将数据集更改为数字"时,概率显示出一种反现象,即每个样本的最低概率以样本为1的概率为0.00190932占住了output()的输出标签.样品#2的0.00220549.

However, when I change the dataset to "digits" with the following code, the probabilities reveal an inverse phenomenon, that the lowest probability of each sample dominates the outputted labels of the predict() with probability 0.00190932 for sample #1 and 0.00220549 for sample #2.

D = datasets.load_digits()

输出:

(1797, 64) (1797,)
target: [0 1]
class: [0 1 2 3 4 5 6 7 8 9]
output_predict: [0 1]
output_proba: [[ 0.00190932  0.11212957  0.1092459   0.11262532      0.11150733  0.11208733
0.11156622  0.11043403  0.10747514  0.11101985]
[ 0.10991574  0.00220549  0.10944998  0.11288081  0.11178518   0.11234661
0.11182221  0.11065663  0.10770783  0.11122952]]

我已阅读这篇文章,它提出了一个解决方案使用带有Decision_function()的线性SVM.但是,由于我的任务,我仍然必须专注于SVM的卡方内核.

I've read this post and it leads a solution to using linear SVM with decision_function(). However, because of my task, I still have to focus on the chi-squared kernel for SVM.

有解决方案吗?

推荐答案

作为

As the documentation states, there is no guarantee that predict_proba and predict will give consistent results on SVC. You can simply use decision_function. That is true for both linear and kernel SVM.

这篇关于scikit-learn的svm的predict_proba的令人困惑的概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆