帮助理解交叉验证和决策树 [英] Help Understanding Cross Validation and Decision Trees

查看:30
本文介绍了帮助理解交叉验证和决策树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读决策树和交叉验证,我理解这两个概念.但是,我无法理解交叉验证,因为它与决策树有关.本质上,当数据集相对较小时,交叉验证允许您在训练和测试之间交替以最大化您的错误估计.一个非常简单的算法是这样的:

I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:

  1. 决定你想要的折叠次数 (k)
  2. 将您的数据集细分为 k 折
  3. 对训练集使用 k-1 折叠来构建一棵树.
  4. 使用测试集来估计有关树中错误的统计信息.
  5. 保存您的结果以备后用
  6. 重复步骤 3-6 k 次,为您的测试集留下不同的折叠.
  7. 平均迭代中的错误以预测整体错误

我想不通的问题是,最后你会有 k 个决策树,它们都可能略有不同,因为它们的分割方式可能不同,等等.你选择哪棵树?我的一个想法是选择错误最少的那个(尽管这并不能使它成为最佳选择,只是因为它在给出的折叠上表现最好 - 也许使用分层会有所帮助,但我读过的所有内容都说它只会有一点帮助).

The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).

据我所知,交叉验证的重点是计算节点统计数据,以便稍后用于修剪.因此,实际上树中的每个节点都会根据给定的测试集为其计算统计数据.重要的是节点统计数据中的这些,但如果您平均错误.当每棵树在选择拆分的内容等方面各不相同时,您如何在 k 棵树的每个节点内合并这些统计信息.

As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.

计算每次迭代的总体误差有什么意义?这不是修剪过程中可以使用的东西.

What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.

对这个小皱纹的任何帮助将不胜感激.

Any help with this little wrinkle would be much appreciated.

推荐答案

我想不通的问题是最后你会有 k 个决策树,它们可能都略有不同,因为它们可能不会分裂以同样的方式,等等.你选择哪棵树?

交叉验证的目的不是帮助选择分类器(或决策树,或任何自动学习应用程序)的特定实例,而是要限定em>model,即提供诸如平均错误率、相对于该平均值的偏差等指标,这些指标可用于断言人们对应用程序的期望精度水平.交叉验证可以帮助断言的一件事是训练数据是否足够大.

The purpose of cross validation is not to help select a particular instance of the classifier (or decision tree, or whatever automatic learning application) but rather to qualify the model, i.e. to provide metrics such as the average error ratio, the deviation relative to this average etc. which can be useful in asserting the level of precision one can expect from the application. One of the things cross validation can help assert is whether the training data is big enough.

关于选择特定的树,您应该对 100% 的可用训练数据进行另一次训练,因为这通常会产生更好的树.(交叉验证方法的缺点是我们需要将 [通常很少] 数量的训练数据划分为折叠",正如您在问题中暗示的那样,这可能导致树对于特定数据实例要么过度拟合要么欠拟合).

With regards to selecting a particular tree, you should instead run yet another training on 100% of the training data available, as this typically will produce a better tree. (The downside of the Cross Validation approach is that we need to divide the [typically little] amount of training data into "folds" and as you hint in the question this can lead to trees which are either overfit or underfit for particular data instances).

在决策树的情况下,我不确定您对在节点中收集并用于修剪树的统计数据的引用是什么.也许是交叉验证相关技术的特殊用途?...

In the case of decision tree, I'm not sure what your reference to statistics gathered in the node and used to prune the tree pertains to. Maybe a particular use of cross-validation related techniques?...

这篇关于帮助理解交叉验证和决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆