在sklearn中使用RandomForestClassifier进行不平衡分类 [英] Unbalanced classification using RandomForestClassifier in sklearn
问题描述
我有一个数据集,其中的类是不平衡的.类别为"1"或"0",其中类别"1":"0"的比率为5:1.如何在带有随机森林的sklearn中计算每个类别的预测误差以及相应的重新平衡权重,类似于以下链接:http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
推荐答案
You can pass sample weights argument to Random Forest fit method
sample_weight : array-like, shape = [n_samples] or None
样品重量.如果为无",则对样本进行均等加权.分裂 将创建净零或负权重的子节点是 在每个节点中搜索拆分时忽略.如果是 分类,如果拆分会导致任何拆分,也将被忽略 单个类在任一子节点中均负负.
Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.
在较早的版本中,存在一种preprocessing.balance_weights
方法来生成给定样本的平衡权重,以使类变得均匀分布.它仍然存在,在内部但仍可用 preprocessing._weights >模块,但已过时,将在以后的版本中将其删除.不知道确切的原因.
In older version there were a preprocessing.balance_weights
method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.
更新
有些澄清,您似乎很困惑. sample_weight
用法很简单,一旦您记住它的目的是在训练数据集中平衡目标类别.也就是说,如果将X
作为观察值并将y
作为类(标签),则len(X) == len(y) == len(sample_wight)
和sample witght
1-d数组的每个元素代表对应的(observation, label)
对的权重.对于您的情况,如果1
类表示为0
类的5次,并且平衡了类的分布,则可以使用简单的
Some clarification, as you seems to be confused. sample_weight
usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X
as observations and y
as classes (labels), then len(X) == len(y) == len(sample_wight)
, and each element of sample witght
1-d array represent weight for a corresponding (observation, label)
pair. For your case, if 1
class is represented 5 times as 0
class is, and you balance classes distributions, you could use simple
sample_weight = np.array([5 if i == 0 else 1 for i in y])
将5
的权重分配给所有0
实例,将1
的权重分配给所有1
实例.有关更多balance_weights
权重评估功能,请参见上面的链接.
assigning weight of 5
to all 0
instances and weight of 1
to all 1
instances. See link above for a bit more crafty balance_weights
weights evaluation function.
这篇关于在sklearn中使用RandomForestClassifier进行不平衡分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!