rpart决策树中的rel误差和x误差有什么区别? [英] What is the difference between rel error and x error in a rpart decision tree?

查看:86
本文介绍了rpart决策树中的rel误差和x误差有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自 UCI 机器学习数据库的纯分类数据框https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

我正在使用 rpart 来形成一个基于新类别的决策树,即患者是否在 30 天前返回(一个新的失败类别).

我的决策树使用以下参数

 tree_model <- rpart(失败~种族+性别+年龄+时间_in_hospital+医疗专业+num_lab_procedures+num_procedures+num_medications+number_outpatient+number_emergency+number_inpatient+number_diagnoses+max_glu_serum+a1Clu_serum+a1Clipizone+result+a1Clipizone+a1Clipz胰岛素+变化,方法=类",数据=训练数据,控制=rpart.control(minsplit=2,cp=0.0001,maxdepth=20,xval=10),parms=list(split=gini"))

打印结果产生:

 CP nsplit rel error xerror xstd1 0.00065883 0 1.00000 1.0000 0.0185182 0.00057648 8 0.99424 1.0038 0.0185493 0.00025621 10 0.99308 1.0031 0.0185434 0.00020000 13 0.99231 1.0031 0.018543

我看到随着决策树的分支,相对误差正在下降,但 xerror 上升 - 我不明白,因为我原以为错误会减少分支越多,越复杂树是.

我认为 xerror 是最重要的,因为大多数修剪树的方法都会从根部切割树.

有人可以向我解释为什么 xerror 是修剪树时关注的焦点吗?而当我们总结决策树分类器的误差是多少时,误差是0.99231还是1.0031?

解决方案

x-error 是交叉验证错误(rpart 有内置的交叉验证).您将 rel_error、xerror 和 xstd 3 列一起使用来帮助您选择修剪树的位置.

每一行代表树的不同高度.一般来说,树中的级别越多,意味着它在训练中的分类错误越低.但是,您会面临过度拟合的风险.通常,交叉验证错误实际上会随着树获得更多级别(至少在最佳"级别之后)而增长.

一个经验法则是选择 最低 级别,其中 rel_error + xstd <;错误.

如果您在输出上运行 plotcp,它还会显示修剪树的最佳位置.

另请参阅此处.>

I have a purely categorical dataframe from the UCI machine learning database https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

I am using rpart to form a decision tree based on a new category on whether patients return before 30 days (a new failed category).

I am using the following parameters for my decision tree

    tree_model <- rpart(Failed ~ race + gender + age+ time_in_hospital+ medical_specialty + num_lab_procedures+ num_procedures+num_medications+number_outpatient+number_emergency+number_inpatient+number_diagnoses+max_glu_serum+ A1Cresult+metformin+glimepiride+glipizide+glyburide+pioglitazone+rosiglitazone+insulin+change,method="class", data=training_data, control=rpart.control(minsplit=2, cp=0.0001, maxdepth=20, xval = 10), parms = list(split = "gini"))

Printing the results yields:

       CP     nsplit rel error  xerror     xstd
1 0.00065883      0   1.00000  1.0000   0.018518
2 0.00057648      8   0.99424  1.0038   0.018549
3 0.00025621     10   0.99308  1.0031   0.018543
4 0.00020000     13   0.99231  1.0031   0.018543

I see that the relative error is going down as the decision tree branches off, but the xerror goes up - which I don't understand as I would have thought that the error would reduce the more branches there are and the more complex the tree is.

I take it that the xerror is most important, since most methods for tree pruning would cut the tree at the root.

Can someone explain to me why the xerror is what is focused on when pruning the tree? And when we summarise what the error of the decision tree classifier is, is the error 0.99231 or 1.0031?

解决方案

The x-error is the cross-validation error (rpart has built-in cross validation). You use the 3 columns, rel_error, xerror and xstd together to help you choose where to prune the tree.

Each row represents a different height of the tree. In general, more levels in the tree mean that it has lower classification error on the training. However, you run the risk of overfitting. Often, the cross-validation error will actually grow as the tree gets more levels (at least, after the 'optimal' level).

A rule of thumb is to choose the lowest level where the rel_error + xstd < xerror.

If you run plotcp on your output it will also show you the optimal place to prune the tree.

Also, see here.

这篇关于rpart决策树中的rel误差和x误差有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆