在scikit-learn python中具有bootstrap = False的随机森林 [英] Random Forest with bootstrap = False in scikit-learn python

查看:613
本文介绍了在scikit-learn python中具有bootstrap = False的随机森林的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们选择bootstrap = False,则RandomForestClassifier()会做什么?

What does RandomForestClassifier() do if we choose bootstrap = False?

根据此链接中的定义

http://scikit- Learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

bootstrap:布尔值,可选(默认= True)是否引导样本 用于建造树木.

bootstrap : boolean, optional (default=True) Whether bootstrap samples are used when building trees.

问这个问题是因为我想对时间序列使用随机森林方法,因此使用滚动窗口大小(tn)和预测日期(t + k)进行训练,并想知道如果我们这样做会发生什么选择是或否:

Asking this because I want to use a Random Forest approach to a time series, so train with a rolling window of size (t-n) and predict date (t+k) and wanted to know if this is what would happen if we choose True or False:

1)如果为Bootstrap = True,则在训练样本时可以是任意一天,并且可以具有任意数量的特征.因此,例如可以从第(t-15)天,第(t-19)天和第(t-35)天中抽取样本,每个样本均具有随机选择的特征,然后预测日期(t + 1)的输出.

1) If Bootstrap = True, so when training samples can be of any day and of any number of features. So for example can have samples from day (t-15), day (t-19) and day (t-35) each one with randomly chosen features and then predict the output of date (t+1).

2)如果Bootstrap = False,它将使用从日期(tn)到t的所有样本和所有特征进行训练,因此它实际上将遵守日期顺序(这意味着它将使用t-35 ,t-34,t-33 ...等直到t-1).然后将预测日期(t + 1)的输出.

2) If Bootstrap = False, its going to use all the samples and all the features from date (t-n) to t, to train, so its actually going to respect the dates order (meaning its going to use t-35, t-34, t-33... etc until t-1). And then will predict output of date (t+1).

如果这是Bootstrap的工作方式,我倾向于使用Boostrap = False,好像不这样做(忽略财务系列)只是忽略连续几天的收益并从t-39天跳到t天会有点奇怪. -19,然后到第t-15天,以预测第t + 1天.在那几天之间,我们将丢失所有信息.

If this is how Bootstrap works I would be inclined to use Boostrap = False, as if not it would be a bit strange (think of financial series) to just ignore the consecutive days returns and jump from day t-39 to t-19 and then to day t-15 to predict day t+1. We would be missing all the info between those days.

那么...这是Bootstrap的工作方式吗?

So... is this how Bootstrap works?

推荐答案

似乎您正在将观察的引导程序与特征采样混为一谈. 统计学习简介提供了一个非常好的介绍到随机森林.

It seems like you're conflating the bootstrap of your observations with the sampling of your features. An Introduction to Statistical Learning provides a really good introduction to Random Forests.

随机森林的好处在于它通过对观测值和特征进行采样来创建大量树木. Bootstrap = False告诉它对替换或不替换的观测值进行采样-当它为False时,仍应采样,而无需替换.

The benefit of random forests comes from its creating a large variety of trees by sampling both observations and features. Bootstrap = False is telling it to sample observations with or without replacement - it should still sample when it's False, just without replacement.

您可以通过设置max_features来告诉您要采样的特征份额,可以是特征份额,也可以是整数(这通常是您需要调整的最佳参数).

You tell it what share of features you want to sample by setting max_features, either to a share of the features or just an integer number (and this is something that you would typically tune to find the best parameter for).

在构建每棵树时不必每天都没事-这就是RF的价值所在.每棵树都将是一个非常糟糕的预测指标,但是当您将数百或数千棵树的预测结果平均起来时,您(可能)会得到一个好的模型.

It will be fine that you're not going to have every day when you're building each tree - that's where the value of RF comes from. Each individual tree will be a pretty bad predictor, but when you average together the predictions from hundreds or thousands of trees you'll (probably) end up with a good model.

这篇关于在scikit-learn python中具有bootstrap = False的随机森林的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆