大数据集的scikit学习矢量化 [英] scikit-learn vectorizing with big dataset

查看:95
本文介绍了大数据集的scikit学习矢量化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的磁盘上有9GB的分段文档,而我的vps只有4GB的内存.

I have 9GB of segmented documents on my disk and my vps only has 4GB memory.

如何在初始化时不加载所有语料库的情况下向量化所有数据集?有样本代码吗?

How can I vectorize all the data set without loading all the corpus at initialization? Is there any sample code?

我的代码如下:

contents = [open('./seg_corpus/' + filename).read()
            for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words)
vectorizer.fit(contents)

推荐答案

尝试此操作,而不是将所有文本加载到内存中,您只能将文件句柄传递到fit方法中,但是必须在input='file' >构造函数.

Try this, instead of loading all texts into memory you can pass only handles to files into fit method, but you must specify input='file' in CountVectorizer constructor.

contents = [open('./seg_corpus/' + filename)
        for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words, input='file')
vectorizer.fit(contents)

这篇关于大数据集的scikit学习矢量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆