在 sklearn 中使用 RandomForestClassifier 进行不平衡分类 [英] Unbalanced classification using RandomForestClassifier in sklearn

查看:30
本文介绍了在 sklearn 中使用 RandomForestClassifier 进行不平衡分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类别不平衡的数据集.类是1"或0",其中1":0"类的比率为 5:1.您如何使用随机森林在 sklearn 中计算每个类的预测误差和相应的重新平衡权重,类似于以下链接:http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

解决方案

您可以将样本权重参数传递给随机森林 拟合方法

sample_weight : 类似数组,shape = [n_samples] 或 None

<块引用>

样本权重.如果没有,则样本的权重相等.分裂将创建净零或负权重的子节点是在每个节点中搜索拆分时被忽略.如果是分类,如果拆分会导致任何结果,也将被忽略在任一子节点中带有负权重的单个类.

在旧版本中,有一个 preprocessing.balance_weights 方法可以为给定的样本生成平衡权重,从而使类均匀分布.它仍然存在,在内部但仍然可用 preprocessing._weights 模块,但已弃用并将在未来版本中删除.不知道具体原因.

更新

一些澄清,因为你似乎很困惑.sample_weight 的用法很简单,一旦你记住它的目的是平衡训练数据集中的目标类.也就是说,如果你有 X 作为观察和 y 作为类(标签),那么 len(X) == len(y) == len(sample_wight),并且 sample witght 一维数组的每个元素代表对应的 (observation, label) 对的权重.对于您的情况,如果 1 类表示为 0 类的 5 倍,并且您平衡类分布,则可以使用简单的

sample_weight = np.array([5 if i == 0 else 1 for i in y])

5的权重分配给所有0实例,将1的权重分配给所有1实例.请参阅上面的链接,了解更巧妙的 balance_weights 权重评估函数.

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

解决方案

You can pass sample weights argument to Random Forest fit method

sample_weight : array-like, shape = [n_samples] or None

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.

Update

Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple

sample_weight = np.array([5 if i == 0 else 1 for i in y])

assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.

这篇关于在 sklearn 中使用 RandomForestClassifier 进行不平衡分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆