如何获得预先指定特征的大型语料库的tf-idf矩阵? [英] How to get tf-idf matrix of a large size corpus, where features are pre-specified?

查看：84 发布时间：2021/7/16 20:14:50 python scikit-learn

本文介绍了如何获得预先指定特征的大型语料库的tf-idf矩阵?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含 3,500,000 个文本文档的语料库.我想构建一个 (3,500,000 * 5,000) 大小的 tf-idf 矩阵.这里我有 5000 个不同的特征(单词).

I have a corpus consisting 3,500,000 text documents. I want to construct a tf-idf matrix of (3,500,000 * 5,000) size. Here I have 5,000 distinct features (words).

我在 python 中使用 scikit sklearn.我在哪里使用 TfidfVectorizer 来做到这一点.我构建了一个 5000 大小的字典(每个特征一个).在初始化 TfidfVectorizer 时，我正在使用特征字典设置参数 vocabulary.但是在调用 fit_transform 时，它显示了一些内存映射，然后是CORE DUMP".

I am using scikit sklearn in python. Where I am using TfidfVectorizer to do that. I have constructed a dictionary of 5000 size(one for each feature). While initializing the TfidfVectorizer I am setting the parameter vocabulary with the dictionary of features. But while calling the fit_transform, it is showing some memory-map and then "CORE DUMP".

TfidfVectorizer 对于固定词汇量和大型语料库是否表现良好?
如果不是，那么其他选择是什么?

Does TfidfVectorizer perform well for a fixed vocabulary and large corpus?
If not, then what are the other options?

如何获得预先指定特征的大型语料库的tf-idf矩阵? [英] How to get tf-idf matrix of a large size corpus, where features are pre-specified?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何获得预先指定特征的大型语料库的tf-idf矩阵? [英] How to get tf-idf matrix of a large size corpus, where features are pre-specified?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭