如何在R的randomForest中使用classwt? [英] how to use classwt in randomForest of R?

查看:38
本文介绍了如何在R的randomForest中使用classwt?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个高度不平衡的数据集,目标类实例的比例如下60000:1000:1000:50(即总共 4 个类).我想使用 randomForest 来预测目标类.

I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomForest for making predictions of the target class.

因此,为了减少类不平衡,我使用了 sampsize 参数,将其设置为 c(5000, 1000, 1000, 50) 和其他一些值,但是没有太多用处.实际上,当我玩sampsize时,第一类的准确率下降了,尽管其他类预测的改进非常小.

So, to reduce the class imbalance, I played with sampsize parameter, setting it to c(5000, 1000, 1000, 50) and some other values, but there was not much use of it. Actually, the accuracy of the 1st class decreased while I played with sampsize, though the improvement in other class predictions was very minute.

在翻阅档案时,我发现了 randomForest() 的另外两个特性,它们是 strataclasswt,它们用于抵消类不平衡问题.

While digging through the archives, I came across two more features of randomForest(), which are strata and classwt that are used to offset class imbalance issue.

classwt上的所有文档都是旧的(一般属于2007、2008年),都建议不要使用randomForest<的classwt特性/code> 包在 R 中,因为它没有像 fortran 那样完全实现其完整的功能.所以第一个问题是:
classwt 现在在 R 的 randomForest 包中完全实现了吗?如果是,将 c(1, 10, 10, 10) 传递给 classwt 参数代表什么?(假设上面有 4 个类在目标变量)

All the documents upon classwt were old (generally belonging to the 2007, 2008 years), which all suggested not the use the classwt feature of randomForest package in R as it does not completely implement its complete functionality like it does in fortran. So the first question is:
Is classwt completely implemented now in randomForest package of R? If yes, what does passing c(1, 10, 10, 10) to the classwt argument represent? (Assuming the above case of 4 classes in the target variable)

据说可以抵消类别不平衡问题的另一个论点是分层抽样,它总是与 sampsize 结合使用.我从文档中了解 sampsize 是什么,但是没有足够的文档或示例清楚地了解使用 strata 来克服类不平衡问题.所以第二个问题是:
什么类型的参数必须传递给randomForest中的strata,它代表什么?

Another argument which is said to offset class imbalance issue is stratified sampling, which is always used in conjunction with sampsize. I understand what sampsize is from the documentation, but there is not enough documentation or examples which gave a clear insight into using strata for overcoming class imbalance issue. So the second question is:
What type of arguments have to be passed to stratain randomForest and what does it represent?

我想我在问题中没有明确提到的 weight 这个词应该在答案中起主要作用.

I guess the word weight which I have not explicitly mentioned in the question should play a major role in the answer.

推荐答案

classwt 已正确传递给 randomForest,请查看此示例:

classwt is correctly passed on to randomForest, check this example:

library(randomForest)
rf = randomForest(Species~., data = iris, classwt = c(1E-5,1E-5,1E5))
rf

#Call:
# randomForest(formula = Species ~ ., data = iris, classwt = c(1e-05, 1e-05, 1e+05)) 
#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 2
#
#        OOB estimate of  error rate: 66.67%
#Confusion matrix:
#           setosa versicolor virginica class.error
#setosa          0          0        50           1
#versicolor      0          0        50           1
#virginica       0          0        50           0

类权重是结果的先验.您需要平衡它们以实现您想要的结果.

Class weights are the priors on the outcomes. You need to balance them to achieve the results you want.

stratasampsize 上,这个答案可能会有所帮助:https://stackoverflow.com/a/20151341/2874779

On strata and sampsize this answer might be of help: https://stackoverflow.com/a/20151341/2874779

一般来说,sampsize 对所有类具有相同的大小似乎是合理的.strata 是将用于分层重采样的一个因素,在您的情况下,您不需要输入任何内容.

In general, sampsize with the same size for all classes seems reasonable. strata is a factor that's going to be used for stratified resampling, in your case you don't need to input anything.

这篇关于如何在R的randomForest中使用classwt?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆