在scikit-learn中使用python生成器 [英] Using python generators in scikit-learn

查看:88
本文介绍了在scikit-learn中使用python生成器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以以及如何将python生成器用作scikit-learn分类器的.fit()函数的数据输入?由于海量数据,这对我来说似乎很有意义.

I was wondering whether and how it is possible to use a python generator as data input to scikit-learn classifier's .fit() functions? Due to huge amounts of data, this seems to make sense to me.

尤其是我将要实施随机森林方法.

In particular I am about to implement a random forest approach.

问候 K

推荐答案

答案为否".要对随机森林进行核心学习,您应该

The answer is "no". To do out of core learning with random forests, you should

  1. 将数据分成适当大小的批次(受您拥有的RAM数量的限制;越大越好);
  2. 训练单独的随机森林;
  3. 将所有基础树一起添加到其中一棵树的estimators_成员中(未试用):

  1. Split your data into reasonably-sized batches (restricted by the amount of RAM you have; bigger is better);
  2. train separate random forests;
  3. append all the underlying trees together in the estimators_ member of one of the trees (untested):

for i in xrange(1, len(forests)):
    forests[0].estimators_.extend(forests[i].estimators_)`

(是的,这很hacky,但是尚未找到解决此问题的方法.请注意,对于非常大的数据集,可能需要抽样一些适合大型计算机RAM的训练示例而不是进行训练另一个选择是使用SGD切换到线性模型,这些模型实现了partial_fit方法,但是显然,它们在可以学习的功能方面受到限制.)

(Yes, this is hacky, but no solution to this problem has been found yet. Note that with very large datasets, it might pay to just sample a number training examples that fits in the RAM of a big machine instead of training on all of it. Another option is to switch to linear models with SGD, those implement a partial_fit method, but obviously they're limited in the kind of functions they can learn.)

这篇关于在scikit-learn中使用python生成器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆