更改随机森林分类器的阈值 [英] Change threshold value for Random Forest classifier

查看:478
本文介绍了更改随机森林分类器的阈值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要开发一个将没有(或接近没有)错误的负值的模型.为此,我绘制了召回精度曲线并确定阈值应设置为 0.11

我的问题是,如何在模型训练时定义阈值?稍后在评估时对其进行定义是没有意义的,因为它不会反映在新数据上.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)rfc_model = RandomForestClassifier(random_state = 101)rfc_model.fit(X_train,y_train)rfc_preds = rfc_model.predict(X_test)callback_precision_vals = []对于np.linspace(0,1,101)中的val:projection_proba = rfc_model.predict_proba(X_test)预测 = (predicted_proba[:, 1] >= val).astype('int')召回率=召回率(y_test,预测)precis_sc = precision_score(y_test,预测的)callback_precision_vals.append({'阈值':val,召回值":recall_sc,'Precis val':precis_sc})Recall_prec_df = pd.DataFrame(recall_precision_vals) 

有什么想法吗?

解决方案

如何在模型训练时定义阈值?

在模型训练期间根本没有没有阈值;随机森林是概率分类器,它仅输出类概率.确实需要阈值的硬"类(即0/1)在模型训练的任何阶段都不会产生或使用-仅在预测期间,甚至只有在我们确实需要硬分类的情况下(并非总是如此)案子).有关更多详细信息,请参见预测类或类概率?.>

实际上,RF的scikit-learn实现实际上根本没有采用阈值,即使对于硬类预测也是如此.仔细阅读文档用于 predict 方法:

预测类别是树上平均概率估计最高的类别

简单来说,这意味着实际的RF输出是 [p0,p1] (假定为二进制分类), predict 方法从中简单地返回带有以下内容的类:最大值,即如果 p0>则为0;否则为0.p1 ,否则为1.

假设如果 p1 从某个阈值开始小于0.5时,您实际要做的是返回1,则您必须放弃 predict ,使用 predict_proba,然后操纵这些返回的概率来获得所需的内容.这是虚拟数据的示例:

从sklearn.ensemble

 导入RandomForestClassifier从sklearn.datasets导入make_classificationX, y = make_classification(n_samples=1000, n_features=4,n_informative = 2,n_redundant = 0,n_classes = 2,random_state = 0,shuffle = False)clf = RandomForestClassifier(n_estimators = 100,max_depth = 2,random_state = 0)clf.fit(X,y) 

这里,简单地使用 predict 作为 X 的第一个元素,将给出 0:

  clf.predict(X)[0]#0 

因为

clf.predict_proba(X)[0]#数组([0.85266881,0.14733119]) 

p0>p1 .

为了得到你想要的(即这里返回第 1 类,因为 p1 > threshold 对于 0.11 的阈值),这是你必须做的:

  prob_preds = clf.predict_proba(X)阈值= 0.11#在此处定义阈值preds = [1,如果prob_preds [i] [1]>阈值,否则范围内的i为0(len(prob_preds))] 

在此之后,很容易看到现在有了第一个预测样本:

  preds [0]#1 

因为如上所述,对于此样本,我们具有 p1 = 0.14733119>阈值.

I need to develop a model which will be free (or close to free) of false negative values. To do so I've plotted Recall-Precision curve and determined that the threshold value should be set to 0.11

My question is, how to define threshold value upon model training? There's no point in defining it later upon evaluation because it won't reflect on new data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

rfc_model = RandomForestClassifier(random_state=101)
rfc_model.fit(X_train, y_train)
rfc_preds = rfc_model.predict(X_test)


recall_precision_vals = []

for val in np.linspace(0, 1, 101):
    predicted_proba = rfc_model.predict_proba(X_test)
    predicted = (predicted_proba[:, 1] >= val).astype('int')
    
    recall_sc = recall_score(y_test, predicted)
    precis_sc = precision_score(y_test, predicted)

    recall_precision_vals.append({
        'Threshold': val,
        'Recall val': recall_sc,
        'Precis val': precis_sc
    })


recall_prec_df = pd.DataFrame(recall_precision_vals)

Any ideas?

解决方案

how to define threshold value upon model training?

There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.

Actually, the scikit-learn implementation of RF doesn't actually employ a threshold at all, even for hard class prediction; reading closely the docs for the predict method:

the predicted class is the one with highest mean probability estimate across the trees

In simple words, this means that the actual RF output is [p0, p1] (assuming binary classification), from which the predict method simply returns the class with the highest value, i.e. 0 if p0 > p1 and 1 otherwise.

Assuming that what you actually want to do is return 1 if p1 is greater from some threshold less than 0.5, you have to ditch predict, use predict_proba instead, and then manipulate these returned probabilities to get what you want. Here is an example with dummy data:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                          n_informative=2, n_redundant=0,
                           n_classes=2, random_state=0, shuffle=False)

clf = RandomForestClassifier(n_estimators=100, max_depth=2,
                            random_state=0)

clf.fit(X, y)

Here, simply using predict for, say, the first element of X, will give 0:

clf.predict(X)[0] 
# 0

because

clf.predict_proba(X)[0]
# array([0.85266881, 0.14733119])

i.e. p0 > p1.

To get what you want (i.e. here returning class 1, since p1 > threshold for a threshold of 0.11), here is what you have to do:

prob_preds = clf.predict_proba(X)
threshold = 0.11 # define threshold here
preds = [1 if prob_preds[i][1]> threshold else 0 for i in range(len(prob_preds))]

after which, it is easy to see that now for the first predicted sample we have:

preds[0]
# 1

since, as shown above, for this sample we have p1 = 0.14733119 > threshold.

这篇关于更改随机森林分类器的阈值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆