无法放入内存的语料库的 TfidfVectorizer [英] TfidfVectorizer for corpus that cannot fit in memory

查看:44
本文介绍了无法放入内存的语料库的 TfidfVectorizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想基于无法放入内存的语料库构建 tf-idf 模型.我阅读了教程,但语料库似乎立即加载:

I want to build a tf-idf model based on a corpus that cannot fit in memory. I read the tutorial but the corpus seems to be loaded at once:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["doc1", "doc2", "doc3"]
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit(corpus)

我想知道是否可以将文档一个一个地加载到内存中,而不是全部加载.

I wonder if I can load the documents into memory one by one instead of loading all of them.

推荐答案

是的,您可以,只需让您的语料库成为迭代器即可.例如,如果您的文档驻留在光盘上,您可以定义一个迭代器,它将文件名列表作为参数,并一个一个地返回文档,而无需一次性将所有内容加载到内存中.

Yes you can, just make your corpus an iterator. For example, if your documents reside on a disc, you can define an iterator that takes as an argument the list of file names, and returns the documents one by one without loading everything into memory at once.

from sklearn.feature_extraction.text import TfidfVectorizer

def make_corpus(doc_files):
    for doc in doc_files:
        yield load_doc_from_file(doc) #load_doc_from_file is a custom function for loading a doc from file

file_list = ... # list of files you want to load
corpus = make_corpus(file_list)
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit(corpus)

这篇关于无法放入内存的语料库的 TfidfVectorizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆