scikit学习RandomForestClassifier中的子样本大小 [英] Subsample size in scikit-learn RandomForestClassifier
问题描述
如何控制用于训练森林中每棵树的子样本的大小?根据scikit-learn的文档:
How is it possible to control the size of the subsample used for the training of each tree in the forest? According to the documentation of scikit-learn:
随机森林是适合多个决策的元估计量数据集的各种子样本上的树分类器和使用平均以提高预测准确性并控制过度拟合.子样本大小始终与原始输入样本相同大小,但如果bootstrap = True,则用替换绘制样本(默认).
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
因此 bootstrap
允许随机性,但找不到控制子样本数量的方法.
So bootstrap
allows randomness but can't find how to control the number of subsample.
推荐答案
Scikit-learn不提供此选项,但是您可以通过结合使用Tree和Bagging元分类器的(慢速)版本来轻松获得此选项:/p>从sklearn.ensemble
Scikit-learn doesn't provide this, but you can easily get this option by using (slower) version using combination of tree and bagging meta-classifier:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=0.5)
作为一个旁注,Breiman的随机森林确实不将子样本作为参数,而是完全依赖于引导程序,因此大约(1-1/e)样本用于构建每棵树.
As a side-note, Breiman's random forest indeed doesn't consider subsample as a parameter, completely relying on bootstrap, so approximately (1 - 1 / e) of samples are used to build each tree.
这篇关于scikit学习RandomForestClassifier中的子样本大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!