使用所有变量的R决策树 [英] R decision tree using all the variables

查看:159
本文介绍了使用所有变量的R决策树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想执行决策树分析。我希望决策树使用模型中的所有变量。

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.

我还需要绘制决策树。

I also need to plot the decision tree. How can I do that in R?

这是我的数据集示例

> head(d)
  TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1               2               2                4        2       0       0     0
2               2               2                4        3       1       0     0
3               2               2                5        1       0       0     0
4               2               2                4        2       1       0     0
5               2               3                3        1       0       0     0
6               2               3                3        2       0       0     0
> 

我想使用公式

myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score

请注意,所有变量都是类别变量。

Note that all the variables are categorical.

编辑:
我的问题是某些变量未出现在最终决策树中。
树的deap应该由惩罚参数alpha定义。我不知道如何设置此惩罚以使所有变量都出现在我的模型中。

换句话说,我想要一个能够将训练误差最小化的模型。

My problem is that some variables do not appear in the final decision tree. The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.

推荐答案

如上所述,如果要对所有变量运行树,则应将其写为

As mentioned above, if you want to run the tree on all the variables you should write it as

ctree(wheeze3 ~ ., d)

您提到的惩罚位于 ctree_control()。您可以在那里设置P值以及最小拆分和存储桶大小。因此,为了最大程度地包含所有变量,您应该执行以下操作:

The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:

ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))

The问题是您将面临过度拟合的风险。

The problem is that you'll get into risk of overfitting.

您需要了解的最后一件事是,可能看不到输出中的所有变量的原因之所以这样,是因为它们对因变量没有重大影响。与线性或逻辑回归不同,线性回归或逻辑回归将显示所有变量并为您提供P值以便确定它们是否有意义,决策树不会返回不合理的变量,即不会被它们分开。

The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.

为更好地了解ctree的工作原理,请在此处查看: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees

For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees

这篇关于使用所有变量的R决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆