使用所有变量的R决策树 [英] R decision tree using all the variables
问题描述
我想执行决策树分析。我希望决策树使用模型中的所有变量。
I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.
我还需要绘制决策树。
I also need to plot the decision tree. How can I do that in R?
这是我的数据集示例
> head(d)
TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1 2 2 4 2 0 0 0
2 2 2 4 3 1 0 0
3 2 2 5 1 0 0 0
4 2 2 4 2 1 0 0
5 2 3 3 1 0 0 0
6 2 3 3 2 0 0 0
>
我想使用公式
myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score
请注意,所有变量都是类别变量。
Note that all the variables are categorical.
编辑:
我的问题是某些变量未出现在最终决策树中。
树的deap应该由惩罚参数alpha定义。我不知道如何设置此惩罚以使所有变量都出现在我的模型中。
换句话说,我想要一个能够将训练误差最小化的模型。
My problem is that some variables do not appear in the final decision tree.
The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.
推荐答案
如上所述,如果要对所有变量运行树,则应将其写为
As mentioned above, if you want to run the tree on all the variables you should write it as
ctree(wheeze3 ~ ., d)
您提到的惩罚位于 ctree_control()
。您可以在那里设置P值以及最小拆分和存储桶大小。因此,为了最大程度地包含所有变量,您应该执行以下操作:
The penalty you mentioned is located at the ctree_control()
. You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:
ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))
The问题是您将面临过度拟合的风险。
The problem is that you'll get into risk of overfitting.
您需要了解的最后一件事是,可能看不到输出中的所有变量的原因之所以这样,是因为它们对因变量没有重大影响。与线性或逻辑回归不同,线性回归或逻辑回归将显示所有变量并为您提供P值以便确定它们是否有意义,决策树不会返回不合理的变量,即不会被它们分开。
The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.
为更好地了解ctree的工作原理,请在此处查看: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
这篇关于使用所有变量的R决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!