如何使用整个训练示例来估计sklearn RandomForest中的班级概率 [英] How to use whole training example to estimate class probabilities in sklearn RandomForest

查看:125
本文介绍了如何使用整个训练示例来估计sklearn RandomForest中的班级概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

经过课程的事先培训,我想使用scikit-learn RandomForestClassifier来估计给定示例属于一组类的概率.

I want to use scikit-learn RandomForestClassifier to estimate the probabilities of a given example to belong to a set of classes, after prior training of course.

我知道我可以使用 predict_proba 方法,将其计算为

I know I can get the class probabilities using the predict_proba method, that calculates them as

[...]森林中树木的平均预测类别概率.

[...] the mean predicted class probabilities of the trees in the forest.

此问题中提到:

一棵树返回的概率是归一化的类 样本落入的叶子的直方图.

The probabilities returned by a single tree are the normalized class histograms of the leaf a sample lands in.

现在,我一直在阅读一些有关概率估计的论文,并且意识到没有简单的解决方案.根据估算随机森林(Böstrom)中的类别概率:

Now, I've been reading some papers on probability estimation and realized there isn't a trivial solution. According to Estimating Class Probabilities in Random Forests (Böstrom):

使用相同的示例来种植树木并估算树木的 概率,必然会导致纯净(因此 小)估计集

using the same examples to both grow the trees and estimate the probabilities, [...] by necessity will lead to pure (and therefore small) estimation sets

这很糟糕.解决方案似乎是使用训练集中的所有示例,而不是仅使用用于生成树的引导程序示例中的示例.

And this is bad. The solution appears to be to use all the examples in the training set, instead of only the ones in the bootstrap sample used to grow the tree.

Scikit-learn确实仅对每棵树使用引导程序样本来计算每个类别的概率估计,对吗? 有人对如何使课堂概率来自于RandomForest的整个训练集有任何指示吗?

Scikit-learn does use only the bootstrap sample for each tree to calculate the probability estimate of each class, right? Does somebody have any pointers about how to proceed to make the class probabilities come from the whole training set of the RandomForest instead?

我认为这将需要一些特殊的Tree子类,该子类不会将类概率分配给树的叶子,然后需要一些过程使用整个训练集从RandomForest分类器中分配它们.

I assume this would need some special Tree subclassing that doesn't assign class probabilities to the leaves of the trees and then some procedure to assign them from the RandomForest classifier using the whole training set.

推荐答案

Scikit-learn确实仅对每棵树使用引导程序样本来计算每个类别的概率估计,对吗?

Scikit-learn does use only the bootstrap sample for each tree to calculate the probability estimate of each class, right?

不,它仅使用样本中的部分,因此不会提供经过高度校准的概率输出(我想这就是本文所建议的).

No, it uses only the in-sample part, and therefore will not give very calibrated probability outputs (which I guess is what the paper suggests).

使用样本外估计,您可以获得更好的概率估计,而且即使使用当前代码库,也可以轻松实现.也许最好使用校准方法作为后处理(使用袋装样品).

You could get better probability estimates using the out-of-sample estimates, and maybe that would even be done easily with the current code base. Maybe it would be better to use a calibration method as post-processing (using the out-of-bag samples).

无论如何,您要实现的是默认设置.

Anyhow, what you want to achieve is the default.

这篇关于如何使用整个训练示例来估计sklearn RandomForest中的班级概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆