在scikit-learn中使用python生成器 [英] Using python generators in scikit-learn
问题描述
我想知道是否可以以及如何将python生成器用作scikit-learn分类器的.fit()函数的数据输入?由于海量数据,这对我来说似乎很有意义.
I was wondering whether and how it is possible to use a python generator as data input to scikit-learn classifier's .fit() functions? Due to huge amounts of data, this seems to make sense to me.
尤其是我将要实施随机森林方法.
In particular I am about to implement a random forest approach.
问候 K
推荐答案
答案为否".要对随机森林进行核心学习,您应该
The answer is "no". To do out of core learning with random forests, you should
- 将数据分成适当大小的批次(受您拥有的RAM数量的限制;越大越好);
- 训练单独的随机森林;
-
将所有基础树一起添加到其中一棵树的
estimators_
成员中(未试用):
- Split your data into reasonably-sized batches (restricted by the amount of RAM you have; bigger is better);
- train separate random forests;
append all the underlying trees together in the
estimators_
member of one of the trees (untested):
for i in xrange(1, len(forests)):
forests[0].estimators_.extend(forests[i].estimators_)`
(是的,这很hacky,但是尚未找到解决此问题的方法.请注意,对于非常大的数据集,可能需要抽样一些适合大型计算机RAM的训练示例而不是进行训练另一个选择是使用SGD切换到线性模型,这些模型实现了partial_fit
方法,但是显然,它们在可以学习的功能方面受到限制.)
(Yes, this is hacky, but no solution to this problem has been found yet. Note that with very large datasets, it might pay to just sample a number training examples that fits in the RAM of a big machine instead of training on all of it. Another option is to switch to linear models with SGD, those implement a partial_fit
method, but obviously they're limited in the kind of functions they can learn.)
这篇关于在scikit-learn中使用python生成器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!