scikit Random Forest sample_weights 的使用 [英] Use of scikit Random Forest sample_weights

查看:74
本文介绍了scikit Random Forest sample_weights 的使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在试图弄清楚 scikit 的随机森林 sample_weight 用途,但我无法解释我看到的一些结果.从根本上说,我需要它来平衡分类问题与不平衡的类.

I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes.

特别是,我期望如果我使用全为 1 的 sample_weights 数组,我会得到与 w sample_weights=None 相同的结果.此外,我希望任何权重相同的数组(即全部为 1、全部为 10 或全部为 0.8...)都将提供相同的结果.在这种情况下,也许我对权重的直觉是错误的.

In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None. Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my intuition of weights is wrong in this case.

代码如下:

import numpy as np
from sklearn import ensemble,metrics, cross_validation, datasets

#create a synthetic dataset with unbalanced classes
X,y = datasets.make_classification(
n_samples=10000, 
n_features=20, 
n_informative=4, 
n_redundant=2, 
n_repeated=0, 
n_classes=2, 
n_clusters_per_class=2, 
weights=[0.9],
flip_y=0.01,
class_sep=1.0, 
hypercube=True, 
shift=0.0, 
scale=1.0, 
shuffle=True, 
random_state=0)

model = ensemble.RandomForestClassifier()

w0=1 #weight associated to 0's
w1=1 #weight associated to 1's

#I should split train and validation but for the sake of understanding sample_weights I'll skip this step
model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y]))    
preds = model.predict(X)
probas = model.predict_proba(X)
ACC = metrics.accuracy_score(y,preds)
precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1])
fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1])
ROC = metrics.auc(fpr, tpr)
cm = metrics.confusion_matrix(y,preds)
print "ACCURACY:", ACC
print "ROC:", ROC
print "F1 Score:", metrics.f1_score(y,preds)
print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0)
print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0)
print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1)
print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)

  • 例如,使用 w0=w1=1 我得到 F1=0.9456.
  • 例如,使用 w0=w1=10 我得到 F1=0.9569.
  • 使用 sample_weights=None 我得到 F1=0.9474.
    • With w0=w1=1 I get, for instance, F1=0.9456.
    • With w0=w1=10 I get, for instance, F1=0.9569.
    • With sample_weights=None I get F1=0.9474.
    • 推荐答案

      使用随机森林算法,顾名思义,它具有一些随机"性.

      With the Random Forest algorithm, there is, as the name implies, some "Random"ness to it.

      您获得了不同的 F1 分数,因为随机森林算法 (RFA) 使用您的数据子集来生成决策树,然后对所有树进行平均.因此,我对您每次运行的 F1 分数相似(但不相同)并不感到惊讶.

      You are getting different F1 score because the Random Forest Algorithm (RFA) is using a subset of your data to generate the decision trees, and then averaging across all of your trees. I am not surprised, therefore, that you have similar (but non-identical) F1 scores for each of your runs.

      我之前尝试过平衡重量.您可能想尝试通过总体中每个类的大小来平衡权重.例如,如果你有两个这样的类:

      I have tried balancing the weights before. You may want to try balancing the weights by the size of each class in the population. For example, if you were to have two classes as such:

      Class A: 5 members
      Class B: 2 members
      

      您可能希望通过为每个 Class A 的成员分配 2/7 和为每个 Class B 的成员分配 5/7 来平衡权重.不过,这只是一个想法作为起点.您如何衡量课程的权重取决于您遇到的问题.

      You may wish to balance the weights by assigning 2/7 for each of Class A's members and 5/7 for each of Class B's members. That's just an idea as a starting place, though. How you weight your classes will depend on the problem you have.

      这篇关于scikit Random Forest sample_weights 的使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆