如何腌制定制的矢量化器? [英] how to pickle customized vectorizer?

查看:79
本文介绍了如何腌制定制的矢量化器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

自定义矢量化器后,我很难将其腌制.

I'm having trouble pickling a vectorizer after I customize it.

from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
tfidf_vectorizer = TfidfVectorizer(analyzer=str.split)
pickle.dump(tfidf_vectorizer, open('test.pkl', "wb"))

这导致 "TypeError:无法腌制method_descriptor对象"

this results in "TypeError: can't pickle method_descriptor objects"

但是,如果我不自定义分析器,它会腌制得很好.关于如何解决这个问题的任何想法吗?如果要更广泛地使用它,我需要坚持使用向量化器.

However, if I don't customize the Analyzer, it pickles fine. Any ideas on how I can get around this problem? I need to persist the vectorizer if I'm going to use it more widely.

顺便说一句,我发现将简单的字符串拆分用于分析器并预处理语料库以除去非词汇和停用词对于提高运行速度至关重要.否则,大多数矢量化程序运行时间都用在"text.py:114(_word_ngrams)"中. HashingVectorizer也是如此

By the way, I've found that using the simple string split for analyzer and pre-processing the corpus to remove non-vocabulary and stop words is essential for decent run speed. Otherwise, most of the vectorizer run time is spent in "text.py:114(_word_ngrams)". Same goes for the HashingVectorizer

这与在sklearn中保留数据

this is related to Persisting data in sklearn and http://scikit-learn.org/0.10/tutorial.html#model-persistence (by the way, sklearn.externals.joblib.dump doesn't help either)

谢谢!

推荐答案

与一般的Python问题相比,这不是scikit-learn问题:

This is not so much a scikit-learn problem as a general Python problem:

>>> pickle.dumps(str.split)
Traceback (most recent call last):
  File "<ipython-input-7-7d3648c78b22>", line 1, in <module>
    pickle.dumps(str.split)
  File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
    Pickler(file, protocol).dump(obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle method_descriptor objects

解决方案是使用可腌制的分析仪:

The solution is to use a pickleable analyzer:

>>> def split(s):
...     return s.split()
... 
>>> pickle.dumps(split)
'c__main__\nsplit\np0\n.'
>>> tfidf_vectorizer = TfidfVectorizer(analyzer=split)
>>> type(pickle.dumps(tfidf_vectorizer))
<type 'str'>

这篇关于如何腌制定制的矢量化器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆