如何基于sklearn中的预测概率对实例进行排名 [英] How to rank the instances based on prediction probability in sklearn

查看:195
本文介绍了如何基于sklearn中的预测概率对实例进行排名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn的支持向量机( SVC )来使用 10-来获取我的实例在数据集中的预测概率,如下所示:折叠交叉验证

I am using sklearn's support vector machine (SVC) as follows to get the prediction probability of my instances in my dataset as follows using 10-fold cross validation.

from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

clf=SVC(class_weight="balanced")
proba = cross_val_predict(clf, X, y, cv=10, method='predict_proba')

print(clf.classes_)
print(proba[:,1])
print(np.argsort(proba[:,1]))

我的预期输出如下对于 print(proba [:,1]) print(np.argsort(proba [:,1])),其中第一个指示类 1 的所有实例的预测概率,第二个指示数据实例的对应索引

My expected output is as follows for print(proba[:,1]) and print(np.argsort(proba[:,1])) where the first one indicates the prediction probability of all instances for class 1 and the second one indicates the corresponding index of the data instance for each probability.

[0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.1 0.  0.  0.
 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
 0.2 0.  0.  0.  0.  0.1 0.  0.  0.  0.  0.  0.  0.  0.  0.9 1.  0.7 1.
 1.  1.  1.  0.7 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.9 0.9 0.1 1.
 0.6 1.  1.  1.  0.9 0.  1.  1.  1.  1.  1.  0.4 0.9 0.9 1.  1.  1.  0.9
 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.9 0.
 0.1 0.  0.  0.  0.  0.  0.  0.  0.1 0.  0.  0.8 0.  0.1 0.  0.1 0.  0.1
 0.3 0.2 0.  0.6 0.  0.  0.  0.6 0.4 0.  0.  0.  0.8 0.  0.  0.  0.  0.
 0.  0.  0.  0.  0.  0. ]

[  0 113 112 111 110 109 107 105 104 114 103 101 100  77 148  49  48  47
  46 102 115 117 118 147 146 145 144 143 142 141 140 139 137 136 135 132
 131 130 128 124 122 120  45  44 149  42  15  26  16  17  18  19  20  21
  22  43  23  24  35  34  33  32  31  30  29  28  27  37  13  25   9  10
   7   6   5   4   3   8  11   2   1  38  39  40  12 108 116  41 121  70
  14 123 125  36 127 126 134  83  72 133 129  52  57 119 138  89  76  50
  84 106  85  69  68  97  98  66  65  64  63  62  61  67  60  58  56  55
  54  53  51  59  71  73  75  96  95  94  93  92  91  90  88  87  86  82
  81  80  79  78  99  74]

我的第一个问题是;似乎 SVC 不支持 predict_proba 。因此,如果我使用 proba = cross_val_predict(clf,X,y,cv = 10,method ='decision_function')是否正确?

My first question is; it seems like SVC does not support predict_proba. Therefore, is it correct if I use proba = cross_val_predict(clf, X, y, cv=10, method='decision_function') instead?

我的第二个问题是如何打印预测概率类别?我尝试了 clf_classes _ 。但是,我收到一个错误,提示 AttributeError:'SVC'对象没有属性'classes _'。有解决方法吗?

My second question is how to print the classes of prediction probability? I tried clf_classes_. But, I get an error saying AttributeError: 'SVC' object has no attribute 'classes_'. Is there a way to resolve this issue?

注意:我想使用交叉验证来获得所有实例的预测概率。

Note: I want to get the prediction probability for all the instances using cross validation.

编辑:

@KRKirov的回答很好。但是,我不需要 GridSearchCV ,只想使用普通的交叉验证。因此,我使用 cross_val_score 更改了他的代码。现在,我遇到错误 NotFittedError:在预测之前先进行拟合

The answer of @KRKirov is great. However, I do not need GridSearchCV and only want to use normal cross validation. Therefore, I changed his code use cross_val_score. Now, I am getting the error NotFittedError: Call fit before prediction.

有没有解决此问题的方法?

Is there a way to resolve this issue?

如果需要,我很乐意提供更多详细信息。

I am happy to provide more details if needed.

推荐答案

Cross_val预测是一个不会在输出中返回分类器(在您的情况下为SVC)的函数。因此,您将无法访问后者及其方法和属性。

Cross_val predict is a function which does not return the classifier (in your case the SVC) as part of its output. Therefore you don't get access to the latter and its methods and attributes.

要执行交叉验证并计算概率,请使用scikit-learn的GridSearchCV或RandomizedSearchCV。如果只想进行简单的交叉验证,请传递仅包含一个参数的参数字典。一旦有了概率,就可以使用pandas或numpy根据特定类别对它们进行排序(在下面的示例中为1)。

To perform cross-validation and calculate probabilities use scikit-learn's GridSearchCV or RandomizedSearchCV. If you want just a simple cross-validation, pass a parameter dictionary with only one parameter. Once you have the probabilities you can use either pandas or numpy to sort them according to a particular class (1 in the example below).

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
import pandas as pd
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target

parameters = {'kernel':(['rbf'])}
svc = SVC(gamma="scale", probability=True)
clf = GridSearchCV(svc, parameters, cv=10)
clf.fit(iris.data, iris.target)

probabilities = pd.DataFrame(clf.predict_proba(X), columns=clf.classes_)
probabilities['Y'] = iris.target
probabilities.columns.name = 'Classes'
probabilities.head()

# Sorting in ascending order by the probability of class 1. 
# Showing only the first five rows.
# Note that all information (indices, values) is in one place
probabilities.sort_values(1).head()
Out[49]: 
Classes         0         1         2  Y
100      0.006197  0.000498  0.993305  2
109      0.009019  0.001023  0.989959  2
143      0.006664  0.001089  0.992248  2
105      0.010763  0.001120  0.988117  2
144      0.006964  0.001295  0.991741  2

# Alternatively using numpy
indices = np.argsort(probabilities.values[:,1])
proba = probabilities.values[indices, :]

print(indices)
[100 109 143 105 144 122 135 118 104 107 102 140 130 117 120 136 132 131
 128 124 125 108  22 148 112  13 115  14  32  37  33 114  35  40  16   4
  42 103   2   0   6  36 139  19 145  38  17  47  48  28  49  15  46 129
  10  21   7  27  12  39   8  11   1   3   9  45  34 116  29 137   5  31
  26  30 141  43  18 111  25  20  41  44  24  23 147 134 113 101 142 110
 146 121 149  83 123 127  77 119 133 126 138  70  72 106  52  76  56  86
  68  63  54  98  50  84  66  85  78  91  73  51  57  58  93  55  87  75
  65  79  90  64  61  60  97  74  94  59  96  81  88  53  95  99  89  80
  71  82  69  92  67  62]

# Showing only the first five values of the sorted probabilities for class 1
print(proba[:5, 1])
[0.00049785 0.00102258 0.00108851 0.00112034 0.00129501]

这篇关于如何基于sklearn中的预测概率对实例进行排名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆