我提供迷你批次的scikit学习分类器的迷你批次训练 [英] Mini batch-training of a scikit-learn classifier where I provide the mini batches
问题描述
我有一个非常大的数据集,无法加载到内存中.
I have a very big dataset that can not be loaded in memory.
我想将此数据集用作scikit-learn分类器的训练集-例如LogisticRegression
.
I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression
.
是否可以在我提供迷你批次的情况下对scikit学习分类器进行迷你批次训练?
Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches?
推荐答案
I believe that some of the classifiers in sklearn
have a partial_fit
method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit
, release the minibatch from memory, and repeat.
If you are particularly interested in doing this for Logistic Regression, then you'll want to use SGDClassifier
, which can be set to use logistic regression when loss = 'log'
.
您只需将微型批处理的功能和标签传递给partial_fit
,就像使用fit
一样:
You simply pass the features and labels for your minibatch to partial_fit
in the same way that you would use fit
:
clf.partial_fit(X_minibatch, y_minibatch)
更新:
我最近遇到了 dask-ml
库,通过将dask
数组与partial_fit
组合在一起,可以使此任务非常容易.链接的网页上有一个示例.
I recently came across the dask-ml
library which would make this task very easy by combining dask
arrays with partial_fit
. There is an example on the linked webpage.
这篇关于我提供迷你批次的scikit学习分类器的迷你批次训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!