处理大量独特的文字以进行文本处理/tf-idf等 [英] Dealing with a large amount of unique words for text processing/tf-idf etc

查看：154 发布时间：2020/5/18 23:14:22 numpy scipy scikit-learn tf-idf

本文介绍了处理大量独特的文字以进行文本处理/tf-idf等的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用scikit进行一些文本处理，例如tfidf.文件名数量可以正常处理(约40k).但是就唯一单词的数量而言，我无法处理数组/矩阵，无论是要获取打印的唯一单词数量的大小，还是将numpy数组转储到文件中(使用savetxt) .下面是回溯.如果我可以获得tfidf的最高值，因为我不需要每个文档的每个单词.或者，我可以从计算中排除其他单词(不是停用词，而是可以添加的文本文件中单独的一组单词将被排除).虽然，我不知道我要删除的文字是否可以缓解这种情况.最后，如果我能以某种方式获取矩阵的片段，那也可以工作.处理此类事情的任何示例都将有所帮助，并为我提供一些构想的起点. (我看过PS，然后尝试了Hashingvectorizer，但似乎我不能用它做tfidf吗?)

I am using scikit to do some text processing, such as tfidf. The amount of filenames is being handled fine (~40k). But as far as the number of unique words, I am not able to deal with the array/matrix, whether it is to get the size of the amount of unique words printed, or to dump the numpy array to a file (using savetxt). Below is the traceback. If I could get the top values of the tfidf, as I dont need them for every single word for every single document. Or, I could exclude other words from the calculations (not stop words, but a separate set of words in a text file I could add that would be excluded). Though, I don't know if the words I would take out would alleviate this situation. Finally, if I could somehow grab pieces of the matrix, that could work too. Any example of dealing with this kind of thing will be helpful and give me some starting points of ideas. (PS I looked at and tried Hashingvectorizer but it doesnt seem I can do tfidf with it?)

Traceback (most recent call last):
  File "/sklearn.py", line 40, in <module>
    array = X.toarray()
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 790, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 239, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 699, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
ValueError: array is too big.

相关代码:

path = "/home/files/"

fh = open('output.txt','w')


filenames = os.listdir(path)

filenames.sort()

try:
    filenames.remove('.DS_Store')
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)
fh.write(str(vectorizer.vocabulary_))

array = X.toarray()
print array.size
print array.shape

如果有帮助，

print 'Array is:' + str(X.get_shape()[0])  + ' by ' + str(X.get_shape()[1]) + ' matrix.'

以我为例，获取太大的稀疏矩阵的维数

Get the dimension of the too large sparse matrix, in my case:

Array is: 39436 by 113214 matrix.

处理大量独特的文字以进行文本处理/tf-idf等 [英] Dealing with a large amount of unique words for text processing/tf-idf etc

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

处理大量独特的文字以进行文本处理/tf-idf等 [英] Dealing with a large amount of unique words for text processing/tf-idf etc

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭