在scikit-learn中为新数据保存特征向量 [英] Saving a feature vector for new data in scikit-learn

查看:65
本文介绍了在scikit-learn中为新数据保存特征向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了创建机器学习算法,我列出了词典,并使用scikit的DictVectorizer为每个项目制作了一个特征向量.然后,我使用一部分数据进行训练从数据集中创建了一个SVM模型,然后在测试集上测试该模型(您知道这是典型的方法).一切都很好,现在我想将模型部署到野外,看看它如何在新的,未标记的,看不见的数据上工作.如何保存特征向量,以便新数据具有相同的大小/特征并可以与SVM模型一起使用?例如,如果我想训练单词的出现:

To create a machine learning algorithm I made a list of dictionaries and used scikit's DictVectorizer to make a feature vector for each item. I then created an SVM model from a dataset using part of the data for training and then testing the model on the test set (you know, the typical approach). Everything worked great and now I want to deploy the model into the wild and see how it works on new, unlabeled, unseen data. How do I save the feature vector so that the new data will have the same size/features and work with the SVM model? For example, if I want to train on the presence of words:

[{
 'contains(the)': 'True',
 'contains(cat)': 'True',
 'contains(is)': 'True',
 'contains(hungry)': 'True'
 }...
]

我训练一个带有相同句子的列表,其中包含成千上万的动物变体.当我对列表进行向量化时,它会考虑所有提到的不同动物,并在向量中为每只动物创建一个索引(,",和"不会改变).现在,当我尝试在一个新句子上使用该模型时,我想预测一个项目:

I train with a list that has the same sentence with thousands of animal variations. When I vectorize the list, it takes into account all the different animals mentioned and creates an index in the vector for each animal ('the', 'is' and 'hungry' don't change). Now when I try to use the model on a new sentence, I want to predict one item:

[{
 'contains(the)': 'True',
 'contains(emu)': 'True',
 'contains(is)': 'True',
 'contains(hungry)': 'True'
 }]

没有原始训练集,当我使用DictVectorizer时,它会生成:(1,1,1,1).这比用于训练我的模型的原始向量少了几千个索引,因此SVM模型将无法使用它.或者,即使向量的长度是正确的,因为它是在一个庞大的句子上进行训练的,这些特征也可能与原始值不对应.如何获得新数据以符合训练向量的维数?功能永远不会比训练集更多,但是并不能保证所有功能都可以出现在新数据中.

Without the original training set, when I use DictVectorizer it generates: (1,1,1,1). This is a couple thousand indexes short of the original vectors used to train my model, so the SVM model will not work with it. Or even if the length of the vector is right because it was trained on a massive sentence, the features may not correspond to the original values. How do I get new data to conform to the dimensions of the training vectors? There will never be more features than the training set, but not all features are guaranteed to be present in new data.

有没有一种方法可以使用泡菜来保存特征向量?或我考虑过的一种方法是生成一个包含所有可能值为'False'的特征的字典.这样会将新数据强制转换为适当的向量大小,并且仅对新数据中存在的项进行计数.

Is there a way to use pickle to save the feature vector? Or one methodI've considered would be to generate a dictionary that contains all the possible features with value 'False'. That forces new data into the proper vector size and only counts the items present in the new data.

我觉得我可能没有充分描述问题,所以如果不清楚,我会尝试更好地解释它.预先谢谢你!

I feel like I may not have described the problem adequately, so if something isn't clear I will attempt to explain it better. Thank you in advance!

感谢larsman的回答,解决方案非常简单:

Thanks to larsman's answer, the solution was pretty simple:

from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
svm_clf = svm.SVC(kernel='linear')
vec_clf = Pipeline([('vectorizer', vec), ('svm', svm_clf)])
vec_clf.fit(X_Train,Y_Train)
joblib.dump(vec_clf, 'vectorizer_and_SVM.pkl') 

对管道和支持向量机进行数据训练.现在,所有未来的模型都可以解决问题,并在SVM中内置了功能向量器.

The pipeline AND the support vector machine are trained to the data. Now all future models can unpickle the pipeline and have a feature vectorizer built into the SVM.

推荐答案

如何获取新数据以符合训练向量的维度?

How do I get new data to conform to the dimensions of the training vectors?

通过使用transform方法而不是fit_transform.后者会从您提供的数据集中学习新词汇.

By using the transform method instead of fit_transform. The latter learns a new vocabulary from the data set you feed it.

有没有一种方法可以使用泡菜来保存特征向量?

Is there a way to use pickle to save the feature vector?

腌制训练有素的矢量化工具.更好的是,对矢量化器和SVM进行Pipeline腌制.您可以使用sklearn.externals.joblib.dump进行有效的酸洗.

Pickle the trained vectorizer. Even better, make a Pipeline of the vectorizer and the SVM and pickle that. You can use sklearn.externals.joblib.dump for efficient pickling.

(此外:如果将Vectorizer传递给布尔值True而不是字符串"True",则矢量化器会更快.)

(Aside: the vectorizer is faster if you pass it the boolean True rather than the string "True".)

这篇关于在scikit-learn中为新数据保存特征向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆