在 scikit learn 中保存和重用 TfidfVectorizer [英] Save and reuse TfidfVectorizer in scikit learn

查看:77
本文介绍了在 scikit learn 中保存和重用 TfidfVectorizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 scikit 中使用 TfidfVectorizer 学习从文本数据创建矩阵.现在我需要保存这个对象以便以后重用.我尝试使用泡菜,但出现以下错误.

loc=open('vectorizer.obj','w')pickle.dump(self.vectorizer,loc)*** 类型错误:不能pickle instancemethod 对象

我尝试在 sklearn.externals 中使用 joblib,这又出现了类似的错误.有没有办法保存这个对象,以便我以后可以重用它?

这是我的完整对象:

class changeToMatrix(object):def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):从 sklearn.feature_extraction.text 导入 TfidfVectorizerself.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,	oken_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=tokenizer)def load_ref_text(self,text_file):文本文件 = 打开(文本文件,'r')行 = textfile.readlines()textfile.close()行 = ' '.join(行)sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')句子 = [ sent_tokenizer.tokenize(lines.strip()) ]句子1 = [item.strip().strip('.') for sublist in statements for item in sublist]chk2=pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer在这一步进行了变换返回句子1,[chk2]def get_processed_data(self,data_loc):ref_sentences,ref_dataframes=self.load_ref_text(data_loc)loc=open("indexedData/vectorizer.obj","w")pickle.dump(self.vectorizer,loc) #getting error hereloc.close()返回 ref_sentences,ref_dataframes

解决方案

首先,最好将导入保留在代码的顶部,而不是在类中:

from sklearn.feature_extraction.text import TfidfVectorizer类 changeToMatrix(对象):def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):...

下一个 StemTokenizer 似乎不是一个规范的类.可能你是从 http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html 或者其他地方我们假设它返回一个字符串列表.

class StemTokenizer(object):def __init__(self):self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}def __call__(self, doc):单词 = []对于 word_tokenize(doc) 中的单词:word = word.lower()w = wn.morphy(word)如果 w 和 len(w) >1 和 w 不在 self.ignore_set 中:word.append(w)回话

现在回答您的实际问题,您可能需要在转储泡菜之前以字节模式打开文件,即:

<预><代码>>>>从 sklearn.feature_extraction.text 导入 TfidfVectorizer>>>从 nltk 导入 word_tokenize>>>导入 cPickle 作为泡菜>>>vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)>>>矢量化器TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',dtype=, encoding=u'utf-8', input=u'content',小写=真,max_df=1.0,max_features=无,min_df=1,ngram_range=(0, 2),norm=u'l2',预处理器=None,smooth_idf=True,stop_words=无,strip_accents='unicode',sublinear_tf=False,token_pattern='[a-zA-Z0-9]+',tokenizer=, use_idf=True,词汇=无)>>>以 open('vectorizer.pk', 'wb') 作为鳍:... pickle.dump(vectorizer, fin)...>>>出口()alvas@ubi:~$ ls -lah vectorizer.pk-rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18 vectorizer.pk

注意:一旦您离开 with 范围,使用 with 习语进行 i/o 文件访问会自动关闭文件.

关于 SnowballStemmer() 的问题,请注意 SnowballStemmer('english') 是一个对象,而词干函数是 SnowballStemmer('english').茎.

重要事项:

  • TfidfVectorizer 的 tokenizer 参数需要一个字符串并返回一个字符串列表
  • 但 Snowball 词干分析器不会将字符串作为输入并返回字符串列表.

所以你需要这样做:

<预><代码>>>>从 nltk.stem 导入 SnowballStemmer>>>从 nltk 导入 word_tokenize>>>词干 = SnowballStemmer('english').stem>>>def stem_tokenize(文本):...返回[词干(i) for i in word_tokenize(text)]...>>>vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)>>>以 open('vectorizer.pk', 'wb') 作为鳍:... pickle.dump(vectorizer, fin)...>>>出口()alvas@ubi:~$ ls -lah vectorizer.pk-rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pk

I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error.

loc=open('vectorizer.obj','w')
pickle.dump(self.vectorizer,loc)
*** TypeError: can't pickle instancemethod objects

I tried using joblib in sklearn.externals, which again gave similar error. Is there any way to save this object so that I can reuse it later?

Here is my full object:

class changeToMatrix(object):
def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
    from sklearn.feature_extraction.text import TfidfVectorizer
    self.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,
                                          token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=tokenizer)

def load_ref_text(self,text_file):
    textfile = open(text_file,'r')
    lines=textfile.readlines()
    textfile.close()
    lines = ' '.join(lines)
    sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = [ sent_tokenizer.tokenize(lines.strip()) ]
    sentences1 = [item.strip().strip('.') for sublist in sentences for item in sublist]      
    chk2=pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer is transformed in this step 
    return sentences1,[chk2]

def get_processed_data(self,data_loc):
    ref_sentences,ref_dataframes=self.load_ref_text(data_loc)
    loc=open("indexedData/vectorizer.obj","w")
    pickle.dump(self.vectorizer,loc) #getting error here
    loc.close()
    return ref_sentences,ref_dataframes

解决方案

Firstly, it's better to leave the import at the top of your code instead of within your class:

from sklearn.feature_extraction.text import TfidfVectorizer
class changeToMatrix(object):
  def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
    ...

Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it returns a list of strings.

class StemTokenizer(object):
    def __init__(self):
        self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}

    def __call__(self, doc):
        words = []
        for word in word_tokenize(doc):
            word = word.lower()
            w = wn.morphy(word)
            if w and len(w) > 1 and w not in self.ignore_set:
                words.append(w)
        return words

Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from nltk import word_tokenize
>>> import cPickle as pickle
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)
>>> vectorizer
TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(0, 2), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents='unicode', sublinear_tf=False,
        token_pattern='[a-zA-Z0-9]+',
        tokenizer=<function word_tokenize at 0x7f5ea68e88c0>, use_idf=True,
        vocabulary=None)
>>> with open('vectorizer.pk', 'wb') as fin:
...     pickle.dump(vectorizer, fin)
... 
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk 
-rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18 vectorizer.pk

Note: Using the with idiom for i/o file access automatically closes the file once you get out of the with scope.

Regarding the issue with SnowballStemmer(), note that SnowballStemmer('english') is an object while the stemming function is SnowballStemmer('english').stem.

IMPORTANT:

  • TfidfVectorizer's tokenizer parameter expects to take a string and return a list of string
  • But Snowball stemmer does not take a string as input and return a list of string.

So you will need to do this:

>>> from nltk.stem import SnowballStemmer
>>> from nltk import word_tokenize
>>> stemmer = SnowballStemmer('english').stem
>>> def stem_tokenize(text):
...     return [stemmer(i) for i in word_tokenize(text)]
... 
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)
>>> with open('vectorizer.pk', 'wb') as fin:
...     pickle.dump(vectorizer, fin)
...
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk 
-rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pk

这篇关于在 scikit learn 中保存和重用 TfidfVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆