在scikit中保存并重用TfidfVectorizer学习 [英] Save and reuse TfidfVectorizer in scikit learn

查看:567
本文介绍了在scikit中保存并重用TfidfVectorizer学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在scikit中使用TfidfVectorizer学习从文本数据创建矩阵.现在,我需要保存该对象以供以后重用.我尝试使用泡菜,但出现了以下错误.

I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error.

loc=open('vectorizer.obj','w')
pickle.dump(self.vectorizer,loc)
*** TypeError: can't pickle instancemethod objects

我尝试在sklearn.externals中使用joblib,这再次给出了类似的错误.有什么方法可以保存该对象,以便以后再使用?

I tried using joblib in sklearn.externals, which again gave similar error. Is there any way to save this object so that I can reuse it later?

这是我的全部对象:

class changeToMatrix(object):
def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
    from sklearn.feature_extraction.text import TfidfVectorizer
    self.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,\
                                          token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=tokenizer)

def load_ref_text(self,text_file):
    textfile = open(text_file,'r')
    lines=textfile.readlines()
    textfile.close()
    lines = ' '.join(lines)
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = [ sent_tokenizer.tokenize(lines.strip()) ]
    sentences1 = [item.strip().strip('.') for sublist in sentences for item in sublist]      
    chk2=pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer is transformed in this step 
    return sentences1,[chk2]

def get_processed_data(self,data_loc):
    ref_sentences,ref_dataframes=self.load_ref_text(data_loc)
    loc=open("indexedData/vectorizer.obj","w")
    pickle.dump(self.vectorizer,loc) #getting error here
    loc.close()
    return ref_sentences,ref_dataframes

推荐答案

首先,最好将导入内容放在代码的顶部,而不是放在类的内部:

Firstly, it's better to leave the import at the top of your code instead of within your class:

from sklearn.feature_extraction.text import TfidfVectorizer
class changeToMatrix(object):
  def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
    ...

下一个StemTokenizer似乎不是一个规范的类.可能您是从 http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html 或其他 我们假定它返回一个字符串列表 .

Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it returns a list of strings.

class StemTokenizer(object):
    def __init__(self):
        self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}

    def __call__(self, doc):
        words = []
        for word in word_tokenize(doc):
            word = word.lower()
            w = wn.morphy(word)
            if w and len(w) > 1 and w not in self.ignore_set:
                words.append(w)
        return words

现在要回答您的实际问题,可能需要在转储泡菜之前以字节模式打开文件,即:

Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from nltk import word_tokenize
>>> import cPickle as pickle
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)
>>> vectorizer
TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(0, 2), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents='unicode', sublinear_tf=False,
        token_pattern='[a-zA-Z0-9]+',
        tokenizer=<function word_tokenize at 0x7f5ea68e88c0>, use_idf=True,
        vocabulary=None)
>>> with open('vectorizer.pk', 'wb') as fin:
...     pickle.dump(vectorizer, fin)
... 
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk 
-rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18 vectorizer.pk

注意:一旦退出with范围,使用with惯用法进行I/O文件访问会自动关闭文件.

Note: Using the with idiom for i/o file access automatically closes the file once you get out of the with scope.

关于SnowballStemmer()的问题,请注意,当词干功能为SnowballStemmer('english').stem时,SnowballStemmer('english')是对象.

Regarding the issue with SnowballStemmer(), note that SnowballStemmer('english') is an object while the stemming function is SnowballStemmer('english').stem.

重要:

  • TfidfVectorizer的tokenizer参数期望采用一个字符串并返回一个字符串列表
  • 但是Snowball stemmer不会将字符串作为输入并返回字符串列表.
  • TfidfVectorizer's tokenizer parameter expects to take a string and return a list of string
  • But Snowball stemmer does not take a string as input and return a list of string.

因此,您需要执行以下操作:

So you will need to do this:

>>> from nltk.stem import SnowballStemmer
>>> from nltk import word_tokenize
>>> stemmer = SnowballStemmer('english').stem
>>> def stem_tokenize(text):
...     return [stemmer(i) for i in word_tokenize(text)]
... 
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)
>>> with open('vectorizer.pk', 'wb') as fin:
...     pickle.dump(vectorizer, fin)
...
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk 
-rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pk

这篇关于在scikit中保存并重用TfidfVectorizer学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆