使用gensim库进行记忆有效的LDA训练 [英] Memory efficient LDA training using gensim library

查看:319
本文介绍了使用gensim库进行记忆有效的LDA训练的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天,我刚刚开始编写一个脚本,该脚本使用gensim库在大型语料库(最少30M句子)上训练LDA模型. 这是我正在使用的当前代码:

Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library. Here is the current code that I am using:

from gensim import corpora, models, similarities, matutils

def train_model(fname):
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    dictionary = corpora.Dictionary(line.lower().split() for line in open(fname))
    print "DOC2BOW"
    corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]

    print "running LDA"
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)

在一个小的语料库(200万个句子)上运行此脚本,我意识到它需要大约7GB的RAM. 当我尝试在较大的语料库上运行它时,由于内存问题,它失败了. 问题显然是由于我使用以下命令加载语料库:

running this script on a small corpus (2M sentences) I realized that it needs about 7GB of RAM. And when I try to run it on the larger corpora, it fails because of the memory issue. The problem is obviously due to the fact that I am loading the corpus using this command:

corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]

但是,我认为没有其他方法,因为调用LdaModel()方法需要它:

But, I think there is no other way because I would need it for calling the LdaModel() method:

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)

我正在寻找解决此问题的方法,但找不到任何有用的方法. 我可以想象这应该是一个普遍的问题,因为我们主要是在非常大型的语料库(通常是维基百科文档)上训练模型.因此,它应该已经是一个解决方案.

I searched for a solution to this problem but I could not find anything helpful. I would imagine that it should be a common problem since we mostly train the models on very large corpora (usually wikipedia documents). So, it should be already a solution for it.

有关此问题及其解决方案的任何想法吗?

Any ideas about this issue and the solution for it?

推荐答案

请考虑将您的corpus打包为可迭代的,并传递它而不是列表(生成器将不起作用).

Consider wrapping your corpus up as an iterable and passing that instead of a list (a generator will not work).

来自该教程:

class MyCorpus(object):
    def __iter__(self):
       for line in open(fname):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

corpus = MyCorpus()
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                      id2word=dictionary,
                                      num_topics=100,
                                      update_every=1,
                                      chunksize=10000,
                                      passes=1)

另外,Gensim还提供了几种易于使用的不同语料库格式,可以在 API参考中找到.您可以考虑使用TextCorpus,它应该已经非常适合您的格式:

Additionally, Gensim has several different corpus formats readily available, which can be found in the API reference. You might consider using TextCorpus, which should fit your format nicely already:

corpus = gensim.corpora.TextCorpus(fname)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                      id2word=corpus.dictionary, # TextCorpus can build the dictionary for you
                                      num_topics=100,
                                      update_every=1,
                                      chunksize=10000,
                                      passes=1)

这篇关于使用gensim库进行记忆有效的LDA训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆