Sklearn TFIDF关于大型文档集 [英] Sklearn TFIDF on large corpus of documents

查看：342 发布时间：2020/7/11 0:39:03 python scikit-learn tfidfvectorizer

本文介绍了Sklearn TFIDF关于大型文档集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在实习项目的背景下，我必须对一大堆文件(〜18000个文件)执行tfidf分析.我正在尝试使用sklearn的TFIDF矢量化器，但面临以下问题:如何避免一次将所有文件加载到内存中?根据我在其他文章上所读的内容，使用迭代器似乎可行，但是如果我将os.listdir(path)中的文件的[open(file)用作 raw_documents 输入到fit_transform()函数时，出现打开文件太多"错误. 在此先感谢您的建议！干杯！保罗

In the context of an internship project, I have to perform a tfidf analyse over a large set of files (~18000). I am trying to use the TFIDF vectorizer from sklearn, but I'm facing the following issue : how can I avoid loading all the files at once in memory ? According to what I read on other posts, it seems to be feasible using an iterable, but if I use for instance [open(file) for file in os.listdir(path)] as the raw_documents input to the fit_transform() function, I am getting a 'too many open files' error. Thanks in advance for you suggestions ! Cheers ! Paul

推荐答案

您是否在TfidfVectorizer中尝试过input='filename'参数?像这样:

Have you tried input='filename' param in TfidfVectorizer? Something like this:

raw_docs_filepaths = [#List containing the filepaths of all the files]

tfidf_vectorizer =  TfidfVectorizer(`input='filename'`)
tfidf_data = tfidf_vectorizer.fit_transform(raw_docs_filepaths)

这应该起作用，因为在这种情况下，矢量化程序在处理该文件时会一次打开一个文件.可以通过交叉检查此处的源代码

This should work, because in this, the vectorizer will open a single file at a time, when processing that. This can be confirmed by cross-checking the source code here

def decode(self, doc):
...
...
    if self.input == 'filename':
        with open(doc, 'rb') as fh:
            doc = fh.read()
...
...

这篇关于Sklearn TFIDF关于大型文档集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Sklearn TFIDF关于大型文档集 [英] Sklearn TFIDF on large corpus of documents

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Sklearn TFIDF关于大型文档集 [英] Sklearn TFIDF on large corpus of documents

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭