如何获得预先指定特征的大型语料库的tf-idf矩阵? [英] How to get tf-idf matrix of a large size corpus, where features are pre-specified?
问题描述
我有一个包含 3,500,000 个文本文档的语料库.我想构建一个 (3,500,000 * 5,000) 大小的 tf-idf 矩阵.这里我有 5000 个不同的特征(单词).
I have a corpus consisting 3,500,000 text documents. I want to construct a tf-idf matrix of (3,500,000 * 5,000) size. Here I have 5,000 distinct features (words).
我在 python 中使用 scikit
sklearn
.我在哪里使用 TfidfVectorizer
来做到这一点.我构建了一个 5000 大小的字典(每个特征一个).在初始化 TfidfVectorizer
时,我正在使用特征字典设置参数 vocabulary
.但是在调用 fit_transform
时,它显示了一些内存映射,然后是CORE DUMP".
I am using scikit
sklearn
in python. Where I am using TfidfVectorizer
to do that. I have constructed a dictionary of 5000 size(one for each feature). While initializing the TfidfVectorizer
I am setting the parameter vocabulary
with the dictionary of features. But while calling the fit_transform
, it is showing some memory-map and then "CORE DUMP".
TfidfVectorizer
对于固定词汇量和大型语料库是否表现良好?- 如果不是,那么其他选择是什么?
- Does
TfidfVectorizer
perform well for a fixed vocabulary and large corpus? - If not, then what are the other options?
推荐答案
其他选项可以是 gensim在内存方面非常有效并且非常快.这是链接到其语料库的 tf-idf 教程.
Other option can be gensim it is very efficient in terms of memory and is very fast. Here is the link to its tf-idf tutorial for your corpus.
这篇关于如何获得预先指定特征的大型语料库的tf-idf矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!