R 随机森林变量重要性 [英] R Random Forests Variable Importance

查看:64
本文介绍了R 随机森林变量重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用随机森林包在 R 中进行分类.

I am trying to use the random forests package for classification in R.

列出的变量重要性度量是:

The Variable Importance Measures listed are:

  • 0 类变量 x 的平均原始重要性得分
  • 第 1 类变量 x 的平均原始重要性得分
  • MeanDecreaseAccuracy
  • MeanDecreaseGini

现在我知道这些意思"是什么,因为我知道它们的定义.我想知道的是如何使用它们.

Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.

我真正想知道的是,这些值仅在它们的准确度、什么是好的值、什么是坏的值、最大值和最小值等的上下文中意味着什么.

What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.

如果一个变量具有较高的 MeanDecreaseAccuracyMeanDecreaseGini,这是否意味着它是重要的还是不重要的?此外,任何有关原始分数的信息也可能有用.我想知道与这些数字的应用相关的所有信息.

If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too. I want to know everything there is to know about these numbers that is relevant to the application of them.

使用错误"、求和"或排列"等词的解释不如不涉及随机森林工作原理的简单解释有用.

An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.

就像如果我想让别人向我解释如何使用收音机,我不希望解释涉及收音机如何将无线电波转换为声音.

Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.

推荐答案

使用错误"、求和"或排列"等词的解释与不涉及任何内容的更简单的解释相比,帮助会更小讨论随机森林的工作原理.

An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.

就像如果我想让别人向我解释如何使用收音机,我不会希望解释涉及无线电如何将无线电波转换为声音.

Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.

你如何解释 WKRP 100.5 FM 中的数字意味着"而不涉及波频率的烦人技术细节?坦率地说,即使您了解一些技术术语,也很难理解随机森林的参数和相关性能问题.

How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.

这是我对一些答案的看法:

Here's my shot at some answers:

-类 0 变量 x 的平均原始重要性得分

-mean raw importance score of variable x for class 0

-类 1 变量 x 的平均原始重要性得分

-mean raw importance score of variable x for class 1

从随机森林简化网页,原始重要性得分衡量特定预测变量在成功分类数据方面比随机更有帮助.

Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.

-MeanDecreaseAccuracy

-MeanDecreaseAccuracy

我认为这仅在 R 模块中,我相信它可以衡量模型中包含该预测变量的程度减少了分类错误.

I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.

-MeanDecreaseGini

-MeanDecreaseGini

Gini 在用于描述社会收入分配时被定义为不平等",或基于树的分类中节点杂质"的度量.较低的基尼系数(即基尼系数的下降幅度较大)意味着特定的预测变量在将数据划分为定义的类别方面发挥着更大的作用.如果不谈论分类树中的数据根据​​预测变量的值在各个节点处拆分这一事实,就很难描述.我不太清楚这如何转化为更好的性能.

Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.

这篇关于R 随机森林变量重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆