如何在Python中使用保存模型进行预测 [英] How to use save model for prediction in python
问题描述
我正在用python做文本分类,我想在生产环境中使用它来对新文档进行预测.我正在使用TfidfVectorizer构建bagofWord.
I am doing a text classification in python and I want to use it in production environment for making prediction on new document. I am using TfidfVectorizer to build bagofWord.
我正在做:
X_train = vectorizer.fit_transform(clean_documents_for_train, classLabel).toarray()
然后我要进行交叉验证,并使用SVM构建模型.之后,我要保存模型.
Then I am doing cross validation and building the model using SVM. After that I am saving the model.
为了对测试数据进行预测,我将模型加载到另一个脚本中,在该脚本中,我具有相同的TfidfVectorizer,并且我知道我无法对测试数据进行fit_transform.我必须做:
For making prediction on my test data I am loading that model in another script where I have the same TfidfVectorizer and I know I can't do fit_transform on my testing data. I have to do:
X_test = vectorizer.transform(clean_test_documents, classLabel).toarray()
但这是不可能的,因为我必须先适应.我知道有办法我可以像在构建模型期间一样加载我的训练数据并执行fit_transform
,但是我的训练数据非常大,每次我想预测自己都做不到时.所以我的问题是:
But this is not possible because I have to fit first. I know there is a way. I can load my training data and perform fit_transform
like I did during building the model but my training data is very large and every time I want to predict I can't do that. So my question is:
- 有没有办法对测试数据使用TfidfVectorizer并执行预测?
- 还有其他执行预测的方法吗?
推荐答案
向量化器是模型的一部分.保存训练有素的SVM模型时,还需要保存相应的矢量化器.
The vectorizer is part of your model. When you save your trained SVM model, you need to also save the corresponding vectorizer.
为使此操作更加方便,您可以使用管道 a>构造单个"fittable"对象,该对象表示将原始输入转换为预测输出所需的步骤.在这种情况下,管道由Tf-Idf提取器和SVM分类器组成:
To make this more convenient, you can use Pipeline to construct a single "fittable" object that represents the steps needed to transform raw input to prediction output. In this case, the pipeline consists of a Tf-Idf extractor and an SVM classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.pipeline import Pipeline
vectorizer = TfidfVectorizer()
clf = svm.SVC()
tfidf_svm = Pipeline([('tfidf', vectorizer), ('svc', clf)])
documents, y = load_training_data()
tfidf_svm.fit(documents, y)
这样,只需要保留一个对象:
This way, only a single object needs to be persisted:
from sklearn.externals import joblib
joblib.dump(tfidf_svm, 'model.pkl')
要将模型应用于测试文档,请加载经过训练的管道,并只需像往常一样使用其predict
函数并将原始文档作为输入即可.
To apply the model on your testing document, load the trained pipeline and simply use its predict
function as usual with raw document(s) as input.
这篇关于如何在Python中使用保存模型进行预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!