在sklearn中使用RandomForestClassifier进行不平衡分类 [英] Unbalanced classification using RandomForestClassifier in sklearn

查看:1234
本文介绍了在sklearn中使用RandomForestClassifier进行不平衡分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中的类是不平衡的.类别为"1"或"0",其中类别"1":"0"的比率为5:1.如何在带有随机森林的sklearn中计算每个类别的预测误差以及相应的重新平衡权重,类似于以下链接:http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

推荐答案

您可以将样本权重参数传递给Random Forest

You can pass sample weights argument to Random Forest fit method

sample_weight : array-like, shape = [n_samples] or None

样品重量.如果为无",则对样本进行均等加权.分裂 将创建净零或负权重的子节点是 在每个节点中搜索拆分时忽略.如果是 分类,如果拆分会导致任何拆分,也将被忽略 单个类在任一子节点中均负负.

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

在较早的版本中,存在一种preprocessing.balance_weights方法来生成给定样本的平衡权重,以使类变得均匀分布.它仍然存在,在内部但仍可用 preprocessing._weights >模块,但已过时,将在以后的版本中将其删除.不知道确切的原因.

In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.

更新

有些澄清,您似乎很困惑. sample_weight用法很简单,一旦您记住它的目的是在训练数据集中平衡目标类别.也就是说,如果将X作为观察值并将y作为类(标签),则len(X) == len(y) == len(sample_wight)sample witght 1-d数组的每个元素代表对应的(observation, label)对的权重.对于您的情况,如果1类表示为0类的5次,并且平衡了类的分布,则可以使用简单的

Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple

sample_weight = np.array([5 if i == 0 else 1 for i in y])

5的权重分配给所有0实例,将1的权重分配给所有1实例.有关更多balance_weights权重评估功能,请参见上面的链接.

assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.

这篇关于在sklearn中使用RandomForestClassifier进行不平衡分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆