什么是“随机森林"中的出库错误? [英] What is out of bag error in Random Forests?
问题描述
什么是随机森林中的出库错误? 是在随机森林中找到正确数量的树木的最佳参数吗?
What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?
推荐答案
我将尝试解释:
假设我们的训练数据集由T表示,并且假设数据集具有M个特征(或属性或变量).
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
和
Xi is input vector {xi1, xi2, ... xiM}
yi is the label (or output or class).
RF摘要:
Random Forests算法是主要基于两种方法的分类器-
Random Forests algorithm is a classifier based on primarily two methods -
- 装袋
- 随机子空间方法.
假设我们决定在森林中拥有S
树木的数量,然后我们首先创建"same size as original"
的S
个数据集,该数据集是通过对T中的数据进行随机重采样替换而创建的(每个数据集n次).这将产生{T1, T2, ... TS}
数据集.这些中的每一个都称为引导程序数据集.由于有替换",每个数据集Ti
可能具有重复的数据记录,并且Ti可能会丢失原始数据集中的多个数据记录.这称为Bootstrapping
. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Suppose we decide to have S
number of trees in our forest then we first create S
datasets of "same size as original"
created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS}
datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti
can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping
. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
装袋是获取引导程序的过程;然后汇总在每个引导程序中学习的模型.
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
现在,RF创建S
树并使用M
可能特征中的m (=sqrt(M) or =floor(lnM+1))
随机子特征来创建任何树.这称为随机子空间方法.
Now, RF creates S
trees and uses m (=sqrt(M) or =floor(lnM+1))
random subfeatures out of M
possible features to create any tree. This is called random subspace method.
因此,为每个Ti
引导数据集创建一个树Ki
.如果要对某些输入数据D = {x1, x2, ..., xM}
进行分类,可以让它通过每棵树并产生S
输出(每棵树一个),可以用Y = {y1, y2, ..., ys}
表示.最终预测是对此集的多数表决.
So for each Ti
bootstrap dataset you create a tree Ki
. If you want to classify some input data D = {x1, x2, ..., xM}
you let it pass through each tree and produce S
outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}
. Final prediction is a majority vote on this set.
袋外错误:
创建分类器(S
树)后,对于原始训练集中的每个(Xi,yi)
,即T
,选择所有不包含(Xi,yi)
的Tk
.请注意,该子集是一组boostrap数据集,其中不包含原始数据集中的特定记录.此集合称为袋外示例".有n
个这样的子集(原始数据集T中的每个数据记录一个). OOB分类器是仅在Tk
上的投票汇总,因此不包含(xi,yi)
.
After creating the classifiers (S
trees), for each (Xi,yi)
in the original training set i.e. T
, select all Tk
which does not include (Xi,yi)
. This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n
such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk
such that it does not contain (xi,yi)
.
泛化误差的袋外估计值是训练集上袋外分类器的错误率(与已知的yi
进行比较).
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi
's).
为什么重要?
Breiman中袋装分类器的错误估计值的研究 [1996b] ,提供了经验证据来证明实际费用的估计 与使用与训练大小相同的测试集一样准确 放.因此,使用袋外误差估计消除了需求 1
The study of error estimates for bagged classifiers in Breiman [1996b], gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.1
(感谢@Rudolf的更正.他在下面的评论.)
(Thanks @Rudolf for corrections. His comments below.)
这篇关于什么是“随机森林"中的出库错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!