具有非常不平衡的类的随机森林 [英] Random Forest with classes that are very unbalanced

查看:40
本文介绍了具有非常不平衡的类的随机森林的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个大数据问题中使用随机森林,它的响应类非常不平衡,所以我阅读了文档,发现了以下参数:

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:

strata 

sampsize

这些参数的文档很少(或者我没有运气找到它)而且我真的不明白如何实现它.我正在使用以下代码:

The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code:

randomForest(x=predictors, 
             y=response, 
             data=train.data, 
             mtry=lista.params[1], 
             ntree=lista.params[2], 
             na.action=na.omit, 
             nodesize=lista.params[3], 
             maxnodes=lista.params[4],
             sampsize=c(250000,2000), 
             do.trace=100, 
             importance=TRUE)

响应是一个具有两个可能值的类,第一个比第二个出现的频率更高(10000:1 或更多)

The response is a class with two possible values, the first one appears more frequently than the second (10000:1 or more)

list.params 是一个具有不同参数的列表(废话!我知道...)

The list.params is a list with different parameters (duh! I know...)

好吧,问题(再次)是:我如何使用 'strata' 参数?我正确使用了 sampsize?

Well, the question (again) is: How I can use the 'strata' parameter? I am using sampsize correctly?

最后,有时我会收到以下错误:

And finally, sometimes I get the following error:

Error in randomForest.default(x = predictors, y = response, data = train.data,  :
  Still have fewer than two classes in the in-bag sample after 10 attempts.

对不起,如果我问了这么多(也许是愚蠢的)问题......

Sorry If I am doing so many (and maybe stupid) questions ...

推荐答案

您应该尝试使用将不平衡程度从 1:10,000 降低到 1:100 或 1:10 的采样方法.您还应该减少生成的树的大小.(目前这些建议我只是凭记忆重复,但我会看看我是否能找到比我的海绵皮层更多的权威.)

You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)

减小树大小的一种方法是将nodesize"设置得更大.由于这种程度的不平衡,您可能需要使节点大小非常大,例如 5-10,000.这是 rhelp 中的一个线程:https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html

One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp: https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html

在问题的当前状态下,您有 sampsize=c(250000,2000), 而我会认为像 sampsize=c(8000,2000) 这样的东西更符合我的建议.我认为您正在创建样本,而您没有任何仅使用 2000 个样本进行采样的组.

In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.

这篇关于具有非常不平衡的类的随机森林的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆