在 scikit learn 中保存和重用 TfidfVectorizer [英] Save and reuse TfidfVectorizer in scikit learn
问题描述
我在 scikit 中使用 TfidfVectorizer 学习从文本数据创建矩阵.现在我需要保存这个对象以便以后重用.我尝试使用泡菜,但出现以下错误.
loc=open('vectorizer.obj','w')pickle.dump(self.vectorizer,loc)*** 类型错误:不能pickle instancemethod 对象
我尝试在 sklearn.externals 中使用 joblib,这又出现了类似的错误.有没有办法保存这个对象,以便我以后可以重用它?
这是我的完整对象:
class changeToMatrix(object):def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):从 sklearn.feature_extraction.text 导入 TfidfVectorizerself.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True, oken_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=tokenizer)def load_ref_text(self,text_file):文本文件 = 打开(文本文件,'r')行 = textfile.readlines()textfile.close()行 = ' '.join(行)sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')句子 = [ sent_tokenizer.tokenize(lines.strip()) ]句子1 = [item.strip().strip('.') for sublist in statements for item in sublist]chk2=pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer在这一步进行了变换返回句子1,[chk2]def get_processed_data(self,data_loc):ref_sentences,ref_dataframes=self.load_ref_text(data_loc)loc=open("indexedData/vectorizer.obj","w")pickle.dump(self.vectorizer,loc) #getting error hereloc.close()返回 ref_sentences,ref_dataframes
首先,最好将导入保留在代码的顶部,而不是在类中:
from sklearn.feature_extraction.text import TfidfVectorizer类 changeToMatrix(对象):def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):...
下一个 StemTokenizer
似乎不是一个规范的类.可能你是从 http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html 或者其他地方我们假设它返回一个字符串列表.
class StemTokenizer(object):def __init__(self):self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}def __call__(self, doc):单词 = []对于 word_tokenize(doc) 中的单词:word = word.lower()w = wn.morphy(word)如果 w 和 len(w) >1 和 w 不在 self.ignore_set 中:word.append(w)回话
现在回答您的实际问题,您可能需要在转储泡菜之前以字节模式打开文件,即:
<预><代码>>>>从 sklearn.feature_extraction.text 导入 TfidfVectorizer>>>从 nltk 导入 word_tokenize>>>导入 cPickle 作为泡菜>>>vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)>>>矢量化器TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',dtype=注意:一旦您离开 with
范围,使用 with
习语进行 i/o 文件访问会自动关闭文件.
关于 SnowballStemmer()
的问题,请注意 SnowballStemmer('english')
是一个对象,而词干函数是 SnowballStemmer('english').茎
.
重要事项:
TfidfVectorizer
的 tokenizer 参数需要一个字符串并返回一个字符串列表- 但 Snowball 词干分析器不会将字符串作为输入并返回字符串列表.
所以你需要这样做:
<预><代码>>>>从 nltk.stem 导入 SnowballStemmer>>>从 nltk 导入 word_tokenize>>>词干 = SnowballStemmer('english').stem>>>def stem_tokenize(文本):...返回[词干(i) for i in word_tokenize(text)]...>>>vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)>>>以 open('vectorizer.pk', 'wb') 作为鳍:... pickle.dump(vectorizer, fin)...>>>出口()alvas@ubi:~$ ls -lah vectorizer.pk-rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pkI am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error.
loc=open('vectorizer.obj','w')
pickle.dump(self.vectorizer,loc)
*** TypeError: can't pickle instancemethod objects
I tried using joblib in sklearn.externals, which again gave similar error. Is there any way to save this object so that I can reuse it later?
Here is my full object:
class changeToMatrix(object):
def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
from sklearn.feature_extraction.text import TfidfVectorizer
self.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,
token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=tokenizer)
def load_ref_text(self,text_file):
textfile = open(text_file,'r')
lines=textfile.readlines()
textfile.close()
lines = ' '.join(lines)
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = [ sent_tokenizer.tokenize(lines.strip()) ]
sentences1 = [item.strip().strip('.') for sublist in sentences for item in sublist]
chk2=pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer is transformed in this step
return sentences1,[chk2]
def get_processed_data(self,data_loc):
ref_sentences,ref_dataframes=self.load_ref_text(data_loc)
loc=open("indexedData/vectorizer.obj","w")
pickle.dump(self.vectorizer,loc) #getting error here
loc.close()
return ref_sentences,ref_dataframes
Firstly, it's better to leave the import at the top of your code instead of within your class:
from sklearn.feature_extraction.text import TfidfVectorizer
class changeToMatrix(object):
def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
...
Next StemTokenizer
don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it returns a list of strings.
class StemTokenizer(object):
def __init__(self):
self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}
def __call__(self, doc):
words = []
for word in word_tokenize(doc):
word = word.lower()
w = wn.morphy(word)
if w and len(w) > 1 and w not in self.ignore_set:
words.append(w)
return words
Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from nltk import word_tokenize
>>> import cPickle as pickle
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)
>>> vectorizer
TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(0, 2), norm=u'l2', preprocessor=None, smooth_idf=True,
stop_words=None, strip_accents='unicode', sublinear_tf=False,
token_pattern='[a-zA-Z0-9]+',
tokenizer=<function word_tokenize at 0x7f5ea68e88c0>, use_idf=True,
vocabulary=None)
>>> with open('vectorizer.pk', 'wb') as fin:
... pickle.dump(vectorizer, fin)
...
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk
-rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18 vectorizer.pk
Note: Using the with
idiom for i/o file access automatically closes the file once you get out of the with
scope.
Regarding the issue with SnowballStemmer()
, note that SnowballStemmer('english')
is an object while the stemming function is SnowballStemmer('english').stem
.
IMPORTANT:
TfidfVectorizer
's tokenizer parameter expects to take a string and return a list of string- But Snowball stemmer does not take a string as input and return a list of string.
So you will need to do this:
>>> from nltk.stem import SnowballStemmer
>>> from nltk import word_tokenize
>>> stemmer = SnowballStemmer('english').stem
>>> def stem_tokenize(text):
... return [stemmer(i) for i in word_tokenize(text)]
...
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)
>>> with open('vectorizer.pk', 'wb') as fin:
... pickle.dump(vectorizer, fin)
...
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk
-rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pk
这篇关于在 scikit learn 中保存和重用 TfidfVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!