使用所有变量的R决策树 [英] R decision tree using all the variables

查看：159 发布时间：2020/10/19 19:20:23 r decision-tree

本文介绍了使用所有变量的R决策树的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想执行决策树分析。我希望决策树使用模型中的所有变量。

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.

我还需要绘制决策树。

I also need to plot the decision tree. How can I do that in R?

这是我的数据集示例

> head(d)
  TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1               2               2                4        2       0       0     0
2               2               2                4        3       1       0     0
3               2               2                5        1       0       0     0
4               2               2                4        2       1       0     0
5               2               3                3        1       0       0     0
6               2               3                3        2       0       0     0
>

我想使用公式

myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score

请注意，所有变量都是类别变量。

Note that all the variables are categorical.

编辑：
我的问题是某些变量未出现在最终决策树中。
树的deap应该由惩罚参数alpha定义。我不知道如何设置此惩罚以使所有变量都出现在我的模型中。

换句话说，我想要一个能够将训练误差最小化的模型。

My problem is that some variables do not appear in the final decision tree. The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.

推荐答案

如上所述，如果要对所有变量运行树，则应将其写为

As mentioned above, if you want to run the tree on all the variables you should write it as

ctree(wheeze3 ~ ., d)

您提到的惩罚位于 ctree_control（）。您可以在那里设置P值以及最小拆分和存储桶大小。因此，为了最大程度地包含所有变量，您应该执行以下操作：

The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:

ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))

The问题是您将面临过度拟合的风险。

The problem is that you'll get into risk of overfitting.

您需要了解的最后一件事是，可能看不到输出中的所有变量的原因之所以这样，是因为它们对因变量没有重大影响。与线性或逻辑回归不同，线性回归或逻辑回归将显示所有变量并为您提供P值以便确定它们是否有意义，决策树不会返回不合理的变量，即不会被它们分开。

The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.

为更好地了解ctree的工作原理，请在此处查看： https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees

For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees

这篇关于使用所有变量的R决策树的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用所有变量的R决策树 [英] R decision tree using all the variables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用所有变量的R决策树 [英] R decision tree using all the variables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭