处理大量独特的文字以进行文本处理/tf-idf等 [英] Dealing with a large amount of unique words for text processing/tf-idf etc

查看:154
本文介绍了处理大量独特的文字以进行文本处理/tf-idf等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit进行一些文本处理,例如tfidf.文件名数量可以正常处理(约40k).但是就唯一单词的数量而言,我无法处理数组/矩阵,无论是要获取打印的唯一单词数量的大小,还是将numpy数组转储到文件中(使用savetxt) .下面是回溯.如果我可以获得tfidf的最高值,因为我不需要每个文档的每个单词.或者,我可以从计算中排除其他单词(不是停用词,而是可以添加的文本文件中单独的一组单词将被排除).虽然,我不知道我要删除的文字是否可以缓解这种情况.最后,如果我能以某种方式获取矩阵的片段,那也可以工作.处理此类事情的任何示例都将有所帮助,并为我提供一些构想的起点. (我看过PS,然后尝试了Hashingvectorizer,但似乎我不能用它做tfidf吗?)

I am using scikit to do some text processing, such as tfidf. The amount of filenames is being handled fine (~40k). But as far as the number of unique words, I am not able to deal with the array/matrix, whether it is to get the size of the amount of unique words printed, or to dump the numpy array to a file (using savetxt). Below is the traceback. If I could get the top values of the tfidf, as I dont need them for every single word for every single document. Or, I could exclude other words from the calculations (not stop words, but a separate set of words in a text file I could add that would be excluded). Though, I don't know if the words I would take out would alleviate this situation. Finally, if I could somehow grab pieces of the matrix, that could work too. Any example of dealing with this kind of thing will be helpful and give me some starting points of ideas. (PS I looked at and tried Hashingvectorizer but it doesnt seem I can do tfidf with it?)

Traceback (most recent call last):
  File "/sklearn.py", line 40, in <module>
    array = X.toarray()
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 790, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 239, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 699, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
ValueError: array is too big.

相关代码:

path = "/home/files/"

fh = open('output.txt','w')


filenames = os.listdir(path)

filenames.sort()

try:
    filenames.remove('.DS_Store')
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)
fh.write(str(vectorizer.vocabulary_))

array = X.toarray()
print array.size
print array.shape

如果有帮助,

print 'Array is:' + str(X.get_shape()[0])  + ' by ' + str(X.get_shape()[1]) + ' matrix.'

以我为例,获取太大的稀疏矩阵的维数

Get the dimension of the too large sparse matrix, in my case:

Array is: 39436 by 113214 matrix.

推荐答案

回溯将答案保留在这里:当您最后调用X.toarray()时,它将把稀疏矩阵表示形式转换为密集表示形式.这意味着,您现在要在 all 个文档中存储 all 个单词的值,而不是为每个文档中的每个单词存储恒定数量的数据.

The traceback holds the answer here: when you call X.toarray() at the end, it's converting a sparse matrix representation to a dense representation. This means that instead of storing a constant amount of data for each word in each document, you're now storing a value for all words over all documents.

值得庆幸的是,大多数操作都适用于稀疏矩阵,或者具有稀疏变体.只需避免调用.toarray().todense(),您就会很方便.

Thankfully, most operations work with sparse matrices, or have sparse variants. Just avoid calling .toarray() or .todense() and you'll be good to go.

有关更多信息,请查看 scipy稀疏矩阵文档.

For more information, check out the scipy sparse matrix documentation.

这篇关于处理大量独特的文字以进行文本处理/tf-idf等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆