大数据集的TFIDF [英] TFIDF for Large Dataset

查看：286 发布时间：2020/5/4 7:24:05 python lucene nlp scikit-learn tf-idf

本文介绍了大数据集的TFIDF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大约有800万条新闻文章的语料库，我需要以稀疏矩阵的形式获取它们的TFIDF表示形式.我已经能够使用scikit-learn来实现相对较少的样本数量的操作，但是我认为它不能用于如此庞大的数据集，因为它首先将输入矩阵加载到内存中，这是一个昂贵的过程.

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.

有人知道，为大型数据集提取TFIDF向量的最佳方法是什么?

Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?

大数据集的TFIDF [英] TFIDF for Large Dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

大数据集的TFIDF [英] TFIDF for Large Dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭