如何计算随机森林的班级权重 [英] How to calculate class weights for Random forests

查看:877
本文介绍了如何计算随机森林的班级权重的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个类别的数据集,必须在其上执行二进制分类.我选择随机森林"作为分类器,因为它使我在其他模型中具有最高的准确性. 数据集1中的数据点数为462,数据集2中包含735个数据点.我注意到我的数据存在轻微的班级失衡,因此我尝试优化我的训练模型并通过提供班级权重来重新训练我的模型.我提供了以下班级权重的值.

I have datasets for 2 classes on which I have to perform binary classification. I chose Random forest as a classifier as it is giving me the best accuracy among other models. Number of datapoints in dataset-1 is 462 and dataset-2 contains 735 datapoints. I have noticed that my data has minor class imbalance so I tried to optimise my training model and retrained my model by providing class weights. I provided following value of class weights.

cwt <- c(0.385,0.614) # Class weights
ss <- c(300,300) # Sample size

我使用以下代码训练了模型

I trained the model using following code

tr_forest <- randomForest(output ~., data = train,
          ntree=nt, mtry=mt,importance=TRUE, proximity=TRUE,
          maxnodes=mn,sampsize=ss,classwt=cwt,
          keep.forest=TRUE,oob.prox=TRUE,oob.times= oobt,
          replace=TRUE,nodesize=ns, do.trace=1
          )

使用选定的班级权重可以提高模型的准确性,但是我仍然怀疑我的方法是正确的还是偶然的.如何确保我的班级体重选择是完美的?

Using chosen class weight has increased the accuracy of my model, but I am still doubtful whether my approach is correct or is it just a coincidence. How can I make sure my class weight choice is perfect?

我使用以下公式计算班级权重:

I calculated class weights using following formula:

正类的类权重=( 数据集1)/(总数据点)

Class weight for positive class = (No. of datapoints in dataset-1)/(Total datapoints)

否定类别的类别权重=( 数据集2)/(总数据点))

Class weight for negative class = (No. of datapoints in dataset-2)/(Total datapoints))

 For dataset-1 462/1197 = 0.385
 For dataset-2 735/1197 = 0.614

这是可以接受的方法,如果不是,为什么它可以提高模型的准确性.请帮助我理解课堂权重的细微差别.

Is this an acceptable method, if not why it is improving the accuracy of my model. Please help me understand the nuances of class weights.

推荐答案

如何确保我的班级体重选择是完美的?

How can I make sure my class weight choice is perfect?

好吧,您当然不能- perfect 是绝对错误的单词;我们正在寻找有用的启发式,它们既可以提高性能,又可以使代码变得有意义(即,它们感觉不像魔术).

Well, you can certainly not - perfect is the absolutely wrong word here; we are looking for useful heuristics, which both improve performance and make sense (i.e. they don't feel like magic).

鉴于此,我们确实有一种独立的方法来交叉检查您的选择(确实听起来不错),尽管使用Python而不是R:

Given that, we do have an independent way of cross-checking your choice (which seems sound indeed), albeit in Python and not in R: the scikit-learn method of compute_class_weight; we don't even need the exact data - only the sample numbers for each class, which you have already provided:

import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_1 = np.ones(462)     # dataset-1
y_2 = np.ones(735) + 1 # dataset-2
y = np.concatenate([y_1, y_2])
len(y)
# 1197

classes=[1,2]
cw = compute_class_weight('balanced', classes, y)
cw
# array([ 1.29545455,  0.81428571])

实际上,这些是您的数字乘以〜2.11,即:

Actually, these are your numbers multiplied by ~ 2.11, i.e.:

cw/2.11
# array([ 0.6139595,  0.3859174])

看起来不错(乘以常数不会影响结果),请保留一个细节:似乎scikit-learn建议我们使用您的数字转置,即,第1类和0.386的权重为0.614对于第2类,而不是根据您的计算反之亦然.

Looks good (multiplications by a constant do not affect the outcome), save one detail: seems that scikit-learn advises us to use your numbers transposed, i.e. a 0.614 weight for class 1 and 0.386 for class 2, instead of vice versa as per your computation.

我们刚刚输入了确切定义的细微之处,即类权重实际上是什么,在框架和库之间不一定是相同的. scikit-learn使用这些权重对错误分类成本进行不同的加权,因此将更大权重分配给 minority 类是有意义的;这是草稿中的想法Breiman(RF的发明者)和Andy Liaw(randomForest R软件包的维护者):

We have just entered the subtleties of the exact definitions of what a class weight actually is, which are not necessarily the same across frameworks and libraries. scikit-learn uses these weights to weight differently the misclassification cost, so it makes sense to assign a greater weight to the minority class; this was the very idea in a draft paper by Breiman (inventor of RF) and Andy Liaw (maintainer of the randomForest R package):

我们为每个类别分配权重,而少数类别的权重较大(即,错误分类的成本较高).

We assign a weight to each class, with the minority class given larger weight (i.e., higher misclassification cost).

尽管如此,randomForest R方法中的classwt自变量似乎不是不是.来自文档:

Nevertheless, this is not what the classwt argument in the randomForest R method seems to be; from the docs:

classwt 课程的优先级.无需加一.忽略回归.

classwt Priors of the classes. Need not add up to one. Ignored for regression.

"类的先决条件"实际上是类存在的类比,即您在此处计算的确切含义;这种用法似乎是相关的(并获得高度赞誉的)SO线程的共识,;此外,Andy Liaw本人声明(强调我的):

"Priors of the classes" is in fact the analogy of the class presence, i.e. exactly what you have computed here; this usage seems to be the consensus of a related (and highly voted) SO thread, What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?; additionally, Andy Liaw himself has stated that (emphasis mine):

randomForest软件包中的当前"classwt"选项与官方的Fortran代码(版本4及更高版本)实现类权重的方式不同.

我猜想官方的Fortran实施与该草案草稿的前引言中所述(即scikit-learn-like)一样.

where the official Fortran implementation I guess was as described in the previous quotation from the draft paper (i.e. scikit-learn-like).

大约6年前,我在理学硕士论文期间就使用RF来处理不平衡数据,据我所知,我发现sampsize参数比classwt有用得多,安迪·利奥(Andy Liaw)再次反对...)已建议(强调我的):

I used RF for imbalanced data myself during my MSc thesis ~ 6 years ago, and, as far as I can remember, I had found the sampsize parameter much more useful that classwt, against which Andy Liaw (again...) has advised (emphasis mine):

在R-help存档中搜索以查看其他选项,以及为什么您不应该使用classwt的原因.

Search in the R-help archive to see other options and why you probably shouldn't use classwt.

此外,在已经很黑暗"的上下文中进行详细说明,完全不清楚同时使用 sampsize 的效果是什么. classwt一起争论,就像您在这里所做的...

What's more, in an already rather "dark" context regarding detailed explanations, it is not at all clear what exactly is the effect of using both sampsize and classwt arguments together, as you have done here...

总结:

  • 您所做的似乎确实是正确且合乎逻辑的
  • 您应尝试单独使用sampsize自变量 (而不是一起使用),以确保将提高的精度归因于此
  • What you have done seems indeed correct and logical
  • You should try using the classwt and sampsize arguments in isolation (and not together), in order to be sure where your improved accuracy should be attributed

这篇关于如何计算随机森林的班级权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆