组合随机森林树时出现意外异常 [英] Unexpected exception when combining random forest trees

查看:69
本文介绍了组合随机森林树时出现意外异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用此问题中描述的信息,在 scikit learn 中结合随机森林模型,我曾尝试使用 python2.7.10 和 sklearn 0.16.1 将几个随机森林分类器组合成一个分类器,但在某些情况下会出现此异常:

Using the information described in this question, Combining random forest models in scikit learn ,I have attempted to combine several random forest classifiers into a single classifier using python2.7.10 and sklearn 0.16.1, but get this exception in some cases:

    Traceback (most recent call last):
      File "sktest.py", line 50, in <module>
        predict(rf)
      File "sktest.py", line 46, in predict
        Y = rf.predict(X)
      File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 462, in predict
        proba = self.predict_proba(X)
      File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 520, in predict_proba
        proba += all_proba[j]
    ValueError: non-broadcastable output operand with shape (39,1) doesn't match the broadcast shape (39,2)

该应用程序是在许多处理器上创建多个随机森林分类器,并将这些对象组合成一个可供所有处理器使用的分类器.

The application is to create a number of random forest classifiers on many processors and combine these objects into a single classifier available to all processors.

产生此异常的测试代码如下所示,它创建了 5 个分类器,其中包含 10 个特征的随机数组.如果将 yfrac 更改为 0.5,则代码不会给出异常.这是组合分类器对象的有效方法吗?此外,当 n_estimators 增加并通过拟合添加数据时,使用warm_start 将树添加到现有的RandomForestClassifier 时也会产生同样的异常.

The test code to produce this exception is shown below, it creates 5 classifiers with a random number of arrays of 10 features. If yfrac is changed to 0.5, the code will not give an exception. Is this a valid method of combining classifier objects? Also, this same exception is created when using warm_start to add trees to an existing RandomForestClassifier when n_estimators is increased and data added via fit.

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from numpy import zeros,random,logical_or,where,array

random.seed(1) 

def generate_rf(X_train, y_train, X_test, y_test, numTrees=50):
  rf = RandomForestClassifier(n_estimators=numTrees, n_jobs=-1)
  rf.fit(X_train, y_train)
  print "rf score ", rf.score(X_test, y_test)
  return rf

def combine_rfs(rf_a, rf_b):
  rf_a.estimators_ += rf_b.estimators_
  rf_a.n_estimators = len(rf_a.estimators_)
  return rf_a

def make_data(ndata, yfrac=0.5):
  nx = int(random.uniform(10,100))

  X = zeros((nx,ndata))
  Y = zeros(nx)

  for n in range(ndata):
    rnA = random.random()*10**(random.random()*5)
    X[:,n] = random.uniform(-rnA,rnA, nx)
    Y = logical_or(Y,where(X[:,n] > yfrac*rnA, 1.,0.))

  return X, Y

def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
  rfs = []
  for u in range(ntrain):
    X, Y = make_data(ndata, yfrac=yfrac)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)

    #Train the random forest and add to list
    rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))

  # Combine the block classifiers into a single classifier
  return reduce(combine_rfs, rfs)

def predict(rf, ndata=10):
  X, Y = make_data(ndata)
  Y = rf.predict(X)

if __name__ == "__main__":
  rf = train(yfrac = 0.42)
  predict(rf)

推荐答案

你的第一个 RandomForest 只得到正例,而其他 RandomForest 得到两种情况.因此,它们的 DecisionTree 结果彼此不兼容.用这个替换的 train() 函数运行你的代码:

Your first RandomForest only gets positive cases, while other RandomForests get both cases. As a result, their DecisionTree results are incompatible with each other. Run your code with this replaced train() function:

def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
  rfs = []
  for u in range(ntrain):
    X, Y = make_data(ndata, yfrac=yfrac)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)

    assert Y_train.sum() != 0
    assert Y_train.sum() != len( Y_train )
    #Train the random forest and add to list
    rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))

  # Combine the block classifiers into a single classifier
  return reduce(combine_rfs, rfs)

使用 StratifiedShuffleSplit 交叉验证生成器而不是 train_test_split,并检查以确保每个 RF 都获得训练集中的两个(所有)类.

Use a StratifiedShuffleSplit cross-validation generator rather than train_test_split, and check to make sure each RF gets both (all) classes in the training set.

这篇关于组合随机森林树时出现意外异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆