尝试更新gensim的LdaModel时出现IndexError [英] IndexError when trying to update gensim's LdaModel

查看:71
本文介绍了尝试更新gensim的LdaModel时出现IndexError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当尝试更新gensim的 LdaModel 时,我遇到以下错误:

I am facing the following error when trying to update my gensim's LdaModel:

IndexError:索引6614超出了尺寸为6614的轴1的边界

IndexError: index 6614 is out of bounds for axis 1 with size 6614

我检查了为什么其他人在此线程,但是我从头到尾都使用相同的字典,这是他们的错误.

I checked why were other people having this issue on this thread, but I am using the same dictionary from the beginning to the end, which was their error.

因为我有一个大数据集,所以我正在逐块加载(使用pickle.load).由于这段代码,我以这种方式迭代地构建字典:

As I have a big dataset, I am loading it chunk by chunk (using pickle.load). I am building the dictionary in this way, iteratively, thanks to this piece of code : 

 fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
 dictionary = Dictionary()
 chunk_no = 0
 while 1:
     try:
         t0 = time()
         documents_lda = pickle.load(fr_documents_lda)
         chunk_no += 1
         dictionary.add_documents(documents_lda)
         t1 = time()
         print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
     except EOFError:
         print("Finished going through pickle")
         break

一旦为整个数据集构建好了,我就以这种方式迭代地训练模型:

Once built for the whole dataset, I am training the model in the same fashion, iteratively, this way :

fr_documents_lda = open("documents_lda_40_rails_30_ruby_full.dat", 'rb')
first_iter = True
chunk_no = 0
lda_gensim = None
while 1:
    try:
        t0 = time()
        documents_lda = pickle.load(fr_documents_lda) 
        chunk_no += 1
        corpus = [dictionary.doc2bow(text) for text in documents_lda]
        if first_iter:
            first_iter = False
            lda_gensim = LdaModel(corpus, num_topics=no_topics, iterations=100, offset=50., random_state=0, alpha='auto')
        else:
            lda_gensim.update(corpus)
        t1 = time()
        print("Chunk number {0} took {1:.2f}s".format(chunk_no, t1-t0))
    except EOFError:
        print("Finished going through pickle")
        break

我还尝试了在每个块上更新字典,即

I also tried updating the dictionary at every chunk, i.e. having  

dictionary.add_documents(documents_lda)

就在之前

corpus = [dictionary.doc2bow(text) for text in documents_lda]

最后一段代码中的

.最后,我尝试将doc2bow的allow_update参数设置为True.什么都行不通.

 in the last piece of code. Finally, I tried setting the allow_update argument of doc2bow to True. Nothing works.

仅供参考,我最后一本字典的大小为85k.仅从第一个块构建的字典的大小为10k.该错误在第二次迭代中通过调用else方法传递else条件时发生.

FYI, the size of my final dictionary is 85k. The size of my dictionary built only from the first chunk is 10k. The error occurs on the second iteration, when it passes in the else condition, when calling the update method.

该错误由行 expElogbetad = self.expElogbeta [:, ids] 引发,由 gamma调用,sstats = self.inference(chunk,collect_sstats = True),其自身由 gammat = self.do_estep(chunk,other)自身,由 lda_gensim.update(corpus).

The error is raised by the line expElogbetad = self.expElogbeta[:, ids] , called by gamma, sstats = self.inference(chunk, collect_sstats=True), itself called by gammat = self.do_estep(chunk, other), itself called by lda_gensim.update(corpus).

有人对如何解决此问题有想法吗?

Is anyone having an idea on how to fix this, or what is happening ?

谢谢.

推荐答案

解决方案只是使用参数 id2word = dictionary 初始化LdaModel.

The solution is simply to initialize the LdaModel with the argument id2word = dictionary.

如果不这样做,则假定您的词汇量是您对其进行训练的第一组文档的词汇量,并且无法更新.实际上,一旦

If you don't do that, it assumes that your vocabulary size is the vocabulary size of the first set of documents you train it on, and can't update it. In fact, it sets its num_terms value to the length of id2word once there, and never updates it afterwards (you can verify in the update function).

这篇关于尝试更新gensim的LdaModel时出现IndexError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆