如何在提升中实施决策树 [英] How to implement decision trees in boosting

查看:170
本文介绍了如何在提升中实施决策树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在实现将使用CART和C4.5的AdaBoost(Boosting).我读了有关AdaBoost的文章,但是我找不到如何将AdaBoost与决策树结合的很好的解释.假设我有具有n个示例的数据集D.我将D分为TR培训示例和TE测试示例. 假设TR.count = m, 所以我将权重设置为1/m,然后使用TR来构建树,使用TR对其进行测试以得到错误的示例,并使用TE进行测试以计算误差.然后,我更改权重,现在如何获得下一个训练集?我应该使用哪种采样(带有或不带有replacemnet)?我知道新的训练集应该更多地关注分类错误的样本,但是我该如何实现呢?那么CART或C4.5如何知道他们应该专注于权重更大的示例?

I'm implementing AdaBoost(Boosting) that will use CART and C4.5. I read about AdaBoost, but i can't find good explenation how to join AdaBoost with Decision Trees. Let say i have data set D that have n examples. I split D to TR training examples and TE testing examples. Let say TR.count = m, so i set weights that should be 1/m, then i use TR to build tree, i test it with TR to get wrong examples, and test with TE to calculate error. Then i change weights, and now how i will get next Training Set? What kind of sampling should i use (with or without replacemnet)? I know that new Training Set should focus more on samples that were wrong classified but how can i achieve this? Well how CART or C4.5 will know that they should focus on examples with greater weight?

推荐答案

据我所知,TE数据集并不意味着用来估计错误率.原始数据可以分为两部分(一个用于训练,另一个用于交叉验证).主要地,我们有两种将权重应用于训练数据集分布的方法.使用哪种方法由您选择的学习能力弱的人决定.

As I know, the TE data sets don't mean to be used to estimate the error rate. The raw data can be split into two parts (one for training, the other for cross validation). Mainly, we have two methods to apply weights on the training data sets distribution. Which method to use is determined by the weak learner you choose.

  • 重新采样训练数据集而无需替换.此方法可以看作是加权提升方法.生成的重采样数据集包含比未正确分类的概率更高的未分类实例,因此迫使弱学习算法专注于未分类的数据.

  • Re-sample the training data sets without replacement. This method can be viewed as weighted boosting method. The generated re-sampling data sets contain miss-classification instances with higher probability than the correctly classified ones, therefore it force the weak learning algorithm to concentrate on the miss-classified data.

学习时直接使用权重.这些模型包括贝叶斯分类,决策树(C4.5和CART)等.对于C4.5,我们计算增益信息(变异信息)以确定哪个预测变量将被选作下一个节点.因此,我们可以结合权重和熵来估计测量值.例如,我们将权重视为分布中样本的概率.给定X = [1,2,3,3],权重[3/8,1/16,3/16,6/16].通常,X的交叉熵为(-0.25log(0.25)-0.25log(0.25)-0.5log(0.5)),但考虑到权重,其加权交叉熵为(-(3/8) log(3/8)-(1/16)log(1/16)-(9/16log(9/16))).通常,C4.5可以通过加权交叉熵来实现,其权重为[1,1,...,1]/N. 如果要使用C4.5算法实现AdaboostM.1,则应阅读第339页上的统计学习的要素".

Directly use the weights when learning. Those models include Bayesian Classification, Decision Tree (C4.5 and CART) and so on. With respect to C4.5, we calculate the the gain information (mutation information) to determinate which predictor will be selected as the next node. Hence we can combine the weights and entropy to estimate the measurements. For example, we view the weights as the probability of the sample in the distribution. Given X = [1,2,3,3], weights [3/8,1/16,3/16,6/16 ]. Normally, the cross-entropy of X is (-0.25log(0.25)-0.25log(0.25)-0.5log(0.5)), but with weights taken into consideration, its weighted cross-entropy is (-(3/8)log(3/8)-(1/16)log(1/16)-(9/16log(9/16))). Generally, the C4.5 can be implemented by weighted cross-entropy, and its weight is [1,1,...,1]/N. If you want to implement the AdaboostM.1 with C4.5 algorithmsm you should read some stuff in Page 339, the Elements of Statistical Learning.

这篇关于如何在提升中实施决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆