Scikit-learn predict_proba 给出了错误的答案 [英] Scikit-learn predict_proba gives wrong answers

查看:117
本文介绍了Scikit-learn predict_proba 给出了错误的答案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是来自如何知道 Scikit-learn 中的 predict_proba 的返回数组中表示哪些类

在那个问题中,我引用了以下代码:

<预><代码>>>>导入sklearn>>>sklearn.__version__'0.13.1'>>>从 sklearn 导入 svm>>>模型 = svm.SVC(概率=真)>>>X = [[1,2,3], [2,3,4]] # 特征向量>>>Y = ['apple', 'orange'] # 类>>>模型拟合(X,Y)>>>model.predict_proba([1,2,3])数组([[ 0.39097541, 0.60902459]])

我在那个问题中发现这个结果代表了属于每个类的点的概率,按照 model.classes_ 给出的顺序

<预><代码>>>>zip(model.classes_,model.predict_proba([1,2,3])[0])[('苹果', 0.39097541289393828), ('橙色', 0.60902458710606167)]

所以......这个答案,如果正确解释,说这个点可能是一个橙色"(由于数据量很小,置信度相当低).但直觉上,这个结果显然是不正确的,因为给出的点与苹果"的训练数据相同.可以肯定的是,我也进行了相反的测试:

<预><代码>>>>zip(model.classes_,model.predict_proba([2,3,4])[0])[('苹果', 0.60705475211840931), ('橙色', 0.39294524788159074)]

同样,显然不正确,但方向相反.

最后,我尝试了更远的点.

<预><代码>>>>X = [[1,1,1], [20,20,20]] # 特征向量>>>模型拟合(X,Y)>>>zip(model.classes_,model.predict_proba([1,1,1])[0])[('苹果', 0.33333332048410247), ('橙色', 0.66666667951589786)]

同样,模型预测了错误的概率.但是,model.predict 函数是正确的!

<预><代码>>>>模型.预测([1,1,1])[0]'苹果'

现在,我记得在文档中读过一些关于 predict_proba 对于小数据集不准确的内容,尽管我似乎无法再次找到它.这是预期的行为,还是我做错了什么?如果这是预期的行为,那么为什么 predict 和 predict_proba 函数与输出不一致?重要的是,数据集需要多大才能让我相信 predict_proba 的结果?

--------更新--------

好的,所以我做了更多的实验":predict_proba 的行为严重依赖于 'n',但不是以任何可预测的方式!

<预><代码>>>>def train_test(n):... X = [[1,2,3], [2,3,4]] * n... Y = ['苹果','橙色'] * n... model.fit(X, Y)...打印n =",n,zip(model.classes_,model.predict_proba([1,2,3])[0])...>>>train_test(1)n = 1 [('苹果', 0.39097541289393828), ('橙色', 0.60902458710606167)]>>>对于范围内的 n(1,10):... train_test(n)...n = 1 [('苹果', 0.39097541289393828), ('橙色', 0.60902458710606167)]n = 2 [('苹果', 0.98437355278112448), ('橙色', 0.015626447218875527)]n = 3 [('苹果', 0.90235408180319321), ('橙色', 0.097645918196806694)]n = 4 [('苹果', 0.83333299908143665), ('橙色', 0.16666700091856332)]n = 5 [('苹果', 0.85714254878984497), ('橙色', 0.14285745121015511)]n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)]n = 7 [('苹果', 0.88888844127886335), ('橙色', 0.11111155872113669)]n = 8 [('苹果', 0.89999988018127364), ('橙色', 0.10000011981872642)]n = 9 [('苹果', 0.90909082368682159), ('橙色', 0.090909176313178491)]

我应该如何在我的代码中安全地使用这个函数?至少,是否有任何 n 值可以保证与 model.predict 的结果一致?

解决方案

如果你使用 svm.LinearSVC() 作为估计器,并且 .decision_function()(这是像 svm.SVC 的 .predict_proba()) 用于将结果从最可能的类别排序到最不可能的类别.这与 .predict() 函数一致.另外,这个估计器更快,并且给出与 svm.SVC()

几乎相同的结果

对你来说唯一的缺点可能是 .decision_function() 给出了一个有符号的值,比如 -1 到 3 之间,而不是概率值.但与预测一致.

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

In that question, I quoted the following code:

>>> import sklearn
>>> sklearn.__version__
'0.13.1'
>>> from sklearn import svm
>>> model = svm.SVC(probability=True)
>>> X = [[1,2,3], [2,3,4]] # feature vectors
>>> Y = ['apple', 'orange'] # classes
>>> model.fit(X, Y)
>>> model.predict_proba([1,2,3])
array([[ 0.39097541,  0.60902459]])

I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_

>>> zip(model.classes_, model.predict_proba([1,2,3])[0])
[('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]

So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:

>>> zip(model.classes_, model.predict_proba([2,3,4])[0])
[('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]

Again, obviously incorrect, but in the other direction.

Finally, I tried it with points that were much further away.

>>> X = [[1,1,1], [20,20,20]] # feature vectors
>>> model.fit(X, Y)
>>> zip(model.classes_, model.predict_proba([1,1,1])[0])
[('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]

Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!

>>> model.predict([1,1,1])[0]
'apple'

Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?

-------- UPDATE --------

Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!

>>> def train_test(n):
...     X = [[1,2,3], [2,3,4]] * n
...     Y = ['apple', 'orange'] * n
...     model.fit(X, Y)
...     print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0])
... 
>>> train_test(1)
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
>>> for n in range(1,10):
...     train_test(n)
... 
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)]
n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)]
n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)]
n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)]
n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)]
n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)]
n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)]
n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]

How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?

解决方案

if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict() function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

这篇关于Scikit-learn predict_proba 给出了错误的答案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆