SciKit-Learn 随机森林子样本大小如何可能等于原始训练数据大小? [英] How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

查看:63
本文介绍了SciKit-Learn 随机森林子样本大小如何可能等于原始训练数据大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 SciKit-Learn 随机森林分类器的文档中,声明了

In the documentation of SciKit-Learn Random Forest classifier , it is stated that

子样本大小始终与原始输入样本大小相同,但如果 bootstrap=True(默认值),则样本会替换绘制.

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

我不明白的是,如果样本大小始终与输入样本大小相同,那么我们如何谈论随机选择.这里没有选择,因为我们在每次训练中使用所有(自然是相同的)样本.

What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training.

我在这里遗漏了什么吗?

Am I missing something here?

推荐答案

我相信 这部分 文档回答了您的问题

I believe this part of docs answers your question

在随机森林中(参见 RandomForestClassifier 和RandomForestRegressor 类),构建集成中的每棵树从替换抽取的样本(即引导样本)来自训练集.另外,当分裂一个节点时树的构造,选择的分裂不再是最好在所有功能中拆分.相反,选择的分割是特征的随机子集之间的最佳分割.后果这种随机性,森林的偏差通常会略微增加(关于单个非随机树的偏差)但是,由于平均,它的方差也减少了,通常比补偿更多为增加偏差,从而产生一个整体更好的模型.

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

理解的关键在于抽取样本替换".这意味着每个实例可以被绘制多次.这反过来意味着,训练集中的某些实例多次出现,而有些则根本不存在(out-of-bag).对于不同的树,这些是不同的

The key to understanding is in "sample drawn with replacement". This means that each instance can be drawn more than once. This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). Those are different for different trees

这篇关于SciKit-Learn 随机森林子样本大小如何可能等于原始训练数据大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆