Scikit-更改阈值以创建多个混淆矩阵 [英] Scikit - changing the threshold to create multiple confusion matrixes

查看：274 发布时间：2020/10/2 3:09:22 scikit-learn classification random-forest threshold confusion-matrix

本文介绍了Scikit-更改阈值以创建多个混淆矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在建立一个分类，该分类将通过对俱乐部贷款的数据进行分析，并选择最佳的X笔贷款。我已经训练了一个随机森林，并创建了常用的ROC曲线，混淆矩阵等。

I'm building a classifier that goes through lending club data, and selects the best X loans. I've trained a Random Forest, and created the usual ROC curves, Confusion Matrices, etc.

混淆矩阵将分类器的预测作为参数（多数森林中树木的预测）。但是，我希望在以下位置打印多个混淆矩阵不同的阈值，以了解如果我选择10％的最佳贷款，20％的最佳贷款等会发生什么情况。

The confusion matrix takes as an argument the predictions of the classifier (the majority prediction of the trees in the forest). However, I wish to print multiple confusion matrices at different thresholds, to know what happens if I choose the 10% best loans, the 20% best loans, etc.

我从阅读其他问题时知道更改阈值通常是个坏主意，但是对于这些情况，还有其他方法可以查看混淆矩阵吗？（问题A）

I know from reading other questions that changing the threshold is often a bad idea, but is there any other way to see confusion matrices for these situations? (question A)

如果我继续更改阈值，我是否应该认为最好的方法是预测proba ，然后手动将其阈值传递给混乱矩阵？（问题B）

If I go ahead with changing the threshold, should I assume that the best way to do so it to predict proba and then threshold it by hand, passing that to the Confusion Matrix? (question B)

推荐答案

A。在您的情况下，可以更改阈值，甚至必要。默认阈值为50％，但从业务角度来看，即使15％的不还款概率也足以拒绝此类申请。

A. In your case, changing the threshold is admissible and maybe even necessary. The default threshold is at 50%, but from business point of view even 15% probability of non-repayment might be enough to reject such an application.

实际上，在信用评分通常是在使用通用模型预测违约概率后，针对不同的产品条款或客户细分设置不同的临界值（例如，参见Naeem Siddiqi的信用风险记分卡的第9章）。

In fact, in credit scoring it is common to set different cut-offs for different product terms or customer segments, after predicting probability of default with a common model (see e.g. chapter 9 of "Credit Risk Scorecards" by Naeem Siddiqi).

B 。有两种便捷的方法可以将阈值设置为任意 alpha 而不是50％：

B. There are two convenient ways to threshold at arbitrary alpha instead of 50%:

实际上， predict_proba 并将其阈值手动设置为 alpha 或使用包装器类（请参见下面的代码）。如果您想尝试多个阈值而不重新拟合模型，请使用此方法。

将 class_weights 更改为（alpha， 1-alpha）拟合模型。

Indeed, predict_proba and threshold it to alpha manually, or with a wrapper class (see the code below). Use this if you want to try multiple thresholds without refitting the model.
Change class_weights to (alpha, 1-alpha) before fitting the model.

现在，包装器的示例代码为：

And now, a sample code for the wrapper:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.base import BaseEstimator, ClassifierMixin
X, y = make_classification(random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

class CustomThreshold(BaseEstimator, ClassifierMixin):
    """ Custom threshold wrapper for binary classification"""
    def __init__(self, base, threshold=0.5):
        self.base = base
        self.threshold = threshold
    def fit(self, *args, **kwargs):
        self.base.fit(*args, **kwargs)
        return self
    def predict(self, X):
        return (self.base.predict_proba(X)[:, 1] > self.threshold).astype(int)

rf = RandomForestClassifier(random_state=1).fit(X_train, y_train)
clf = [CustomThreshold(rf, threshold) for threshold in [0.3, 0.5, 0.7]]

for model in clf:
    print(confusion_matrix(y_test, model.predict(X_test)))

assert((clf[1].predict(X_test) == clf[1].base.predict(X_test)).all())
assert(sum(clf[0].predict(X_test)) > sum(clf[0].base.predict(X_test)))
assert(sum(clf[2].predict(X_test)) < sum(clf[2].base.predict(X_test)))

它将为不同的阈值输出3个混淆矩阵：

It will output 3 confusion matrices for different thresholds:

[[13  1]
 [ 2  9]]
[[14  0]
 [ 3  8]]
[[14  0]
 [ 4  7]]

这篇关于Scikit-更改阈值以创建多个混淆矩阵的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scikit-更改阈值以创建多个混淆矩阵 [英] Scikit - changing the threshold to create multiple confusion matrixes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Scikit-更改阈值以创建多个混淆矩阵 [英] Scikit - changing the threshold to create multiple confusion matrixes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭