是否可以逐步训练sklearn模型(例如SVM)? [英] Is it possible to train a sklearn model (eg SVM) incrementally?

查看:602
本文介绍了是否可以逐步训练sklearn模型(例如SVM)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对包含160万条带标签推文的Twitter数据集"Sentiment140"进行情感分析.我正在使用单词袋(Unigram)模型构建特征向量,因此每条推文都由约20000个特征表示.现在要使用此数据集训练我的sklearn模型(SVM,逻辑回归,朴素贝叶斯),我必须将整个1.6m x 20000特征向量加载到一个变量中,然后将其输入模型.即使在我的服务器计算机上总共有115GB的内存,它也会导致进程被终止.

I'm trying to perform sentiment analysis over the twitter dataset "Sentiment140" which consists of 1.6 million labelled tweets . I'm constructing my feature vector using Bag Of Words ( Unigram ) model , so each tweet is represented by about 20000 features . Now to train my sklearn model (SVM,Logistic Regression,Naive Bayes) using this dataset , i have to load the entire 1.6m x 20000 feature vectors into one variable and then feed it to the model . Even on my server machine which has a total of 115GB of memory , it causes the process to be killed .

所以我想知道是否可以逐个实例训练模型实例,而不是将整个数据集加载到一个变量中?

So i wanted to know if i can train the model instance by instance , rather than loading the entire dataset into one variable ?

如果sklearn不具有这种灵活性,那么您是否可以推荐其他库(支持顺序学习)?

If sklearn does not have this flexibility , then is there any other libraries that you could recommend (which support sequential learning) ?

推荐答案

去另一个极端并逐个实例地训练并不是真正必要的(更不用说有效了).您正在寻找的内容实际上称为 incremental online 学习,它可在scikit-learn的 SGDClassifier (确实包含 partial_fit 方法.

It is not really necessary (let alone efficient) to go to the other extreme and train instance by instance; what you are looking for is actually called incremental or online learning, and it is available in scikit-learn's SGDClassifier for linear SVM and logistic regression, which indeed contains a partial_fit method.

这是一个有关虚拟数据的简单示例:

Here is a quick example with dummy data:

import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)

clf.partial_fit(X, Y, classes=np.unique(Y))

X_new = np.array([[-1, -1], [2, 0], [0, 1], [1, 1]])
Y_new = np.array([1, 1, 2, 1])
clf.partial_fit(X_new, Y_new)

losspenalty参数的默认值(分别为'hinge''l2')是

The default values for the loss and penalty arguments ('hinge' and 'l2' respectively) are these of a LinearSVC, so the above code essentially fits incrementally a linear SVM classifier with L2 regularization; these settings can of course be changed - check the docs for more details.

有必要在第一个调用中包含classes参数,该参数应包含问题中的所有现有类(即使其中一些可能不存在于某些局部拟合中);可以在随后的partial_fit调用中将其省略-再次,请参阅链接的文档以获取更多详细信息.

It is necessary to include the classes argument in the first call, which should contain all the existing classes in your problem (even though some of them might not be present in some of the partial fits); it can be omitted in subsequent calls of partial_fit - again, see the linked documentation for more details.

这篇关于是否可以逐步训练sklearn模型(例如SVM)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆