如何获得预先指定特征的大型语料库的tf-idf矩阵? [英] How to get tf-idf matrix of a large size corpus, where features are pre-specified?

查看:84
本文介绍了如何获得预先指定特征的大型语料库的tf-idf矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 3,500,000 个文本文档的语料库.我想构建一个 (3,500,000 * 5,000) 大小的 tf-idf 矩阵.这里我有 5000 个不同的特征(单词).

I have a corpus consisting 3,500,000 text documents. I want to construct a tf-idf matrix of (3,500,000 * 5,000) size. Here I have 5,000 distinct features (words).

我在 python 中使用 scikit sklearn.我在哪里使用 TfidfVectorizer 来做到这一点.我构建了一个 5000 大小的字典(每个特征一个).在初始化 TfidfVectorizer 时,我正在使用特征字典设置参数 vocabulary.但是在调用 fit_transform 时,它显示了一些内存映射,然后是CORE DUMP".

I am using scikit sklearn in python. Where I am using TfidfVectorizer to do that. I have constructed a dictionary of 5000 size(one for each feature). While initializing the TfidfVectorizer I am setting the parameter vocabulary with the dictionary of features. But while calling the fit_transform, it is showing some memory-map and then "CORE DUMP".

  1. TfidfVectorizer 对于固定词汇量和大型语料库是否表现良好?
  2. 如果不是,那么其他选择是什么?
  1. Does TfidfVectorizer perform well for a fixed vocabulary and large corpus?
  2. If not, then what are the other options?

推荐答案

其他选项可以是 gensim在内存方面非常有效并且非常快.这是链接到其语料库的 tf-idf 教程.

Other option can be gensim it is very efficient in terms of memory and is very fast. Here is the link to its tf-idf tutorial for your corpus.

这篇关于如何获得预先指定特征的大型语料库的tf-idf矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆