tfidf矢量化器过程显示错误 [英] tfidf vectorizer process shows error

查看:617
本文介绍了tfidf矢量化器过程显示错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事非英语语料库分析,但是面临几个问题.这些问题之一是tfidf_vectorizer.导入相关的库文件后,我处理了以下代码以获取结果

I am working on non-Engish corpus analysis but facing several problems. One of those problems is tfidf_vectorizer. After importing concerned liberaries, I processed following code to get results

contents = [open("D:\test.txt", encoding='utf8').read()]
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words=stopwords,
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(contents) 

print(tfidf_matrix.shape)

处理完上面的代码后,我收到以下错误消息.

After processing above code I got following error message.

ValueError                                Traceback (most recent call last)
<ipython-input-144-bbcec8b8c065> in <module>()
      5                                  use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(3,3))
      6 
----> 7 get_ipython().magic('time tfidf_matrix = tfidf_vectorizer.fit_transform(contents) #fit the vectorizer to synopses')
      8 
      9 print(tfidf_matrix.shape)

C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in magic(self, arg_s)
   2156         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2157         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158         return self.run_line_magic(magic_name, magic_arg_s)
   2159 
   2160     #-------------------------------------------------------------------------

C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py in run_line_magic(self, magic_name, line)
   2077                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2078             with self.builtin_trap:
-> 2079                 result = fn(*args,**kwargs)
   2080             return result
   2081 

<decorator-gen-60> in time(self, line, cell, local_ns)

C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magic.py in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

C:\Users\mazhar\Anaconda3\lib\site-packages\IPython\core\magics\execution.py in time(self, line, cell, local_ns)
   1178         else:
   1179             st = clock2()
-> 1180             exec(code, glob, local_ns)
   1181             end = clock2()
   1182             out = None

<timed exec> in <module>()

C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1303             Tf-idf-weighted document-term matrix.
   1304         """
-> 1305         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1306         self._tfidf.fit(X)
   1307         # X is already a transformed view of raw_documents so

C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
    836                                                        max_doc_count,
    837                                                        min_doc_count,
--> 838                                                        max_features)
    839 
    840             self.vocabulary_ = vocabulary

C:\Users\mazhar\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _limit_features(self, X, vocabulary, high, low, limit)
    731         kept_indices = np.where(mask)[0]
    732         if len(kept_indices) == 0:
--> 733             raise ValueError("After pruning, no terms remain. Try a lower"
    734                              " min_df or a higher max_df.")
    735         return X[:, kept_indices], removed_terms

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

如果我更改了最小值和最大值,则错误为

If I change then min and max value the error is

推荐答案

假设您的令牌生成器按预期工作,我发现您的代码有两个问题.首先,TfIdfVectorizer需要一个字符串列表,而您只提供一个字符串.其次,min_df=0.2非常高,要包含在内,所有文档中有20%需要使用术语,这对于Trigram功能而言是不太可能的.

Assuming your tokeniser works as expected, I see two problems with your code. First, TfIdfVectorizer expects a list of strings, whereas you are providing a single string. Second, min_df=0.2 is quite high- to be included, a term needs to occur in 20% of all documents, which is very unlikely for trigram features.

以下对我有用的

from sklearn.feature_extraction.text import TfidfVectorizer
with open("README.md") as infile:
    contents = infile.readlines() # Note: readlines() instead of read()

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                   min_df=2, use_idf=True, ngram_range=(3,3))
# note: minimum of 2 occurrences, rather than 0.2 (20% of all documents)

tfidf_matrix = tfidf_vectorizer.fit_transform(contents) 

print(tfidf_matrix.shape)

输出(155, 28)

这篇关于tfidf矢量化器过程显示错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆