每次我在同一个语料库上训练时,LDA 模型都会生成不同的主题 [英] LDA model generates different topics everytime i train on the same corpus

查看:26
本文介绍了每次我在同一个语料库上训练时,LDA 模型都会生成不同的主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 python gensim 从 231 个句子的小语料库中训练潜在狄利克雷分配 (LDA) 模型.然而,每次我重复这个过程时,它都会产生不同的主题.

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.

为什么相同的LDA参数和语料每次都会生成不同的主题?

我该如何稳定话题生成?

我正在使用这个语料库(http://pastebin.com/WptkKVF0)和这个停用词列表(http://pastebin.com/LL7dqLcj),这是我的代码:

I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
    # Build LDA model using the above corpus
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
    corpus_lda = lda[corpus]

    # Group topics with similar words together.
    tops = set(lda.show_topics(50))
    top_clusters = []
    for l in tops:
        top = []
        for t in l.split(" + "):
            top.append((t.split("*")[0], t.split("*")[1]))
        top_clusters.append(top)

    # Generate word only topics
    top_wordonly = []
    for i in top_clusters:
        top_wordonly.append(":".join([j[1] for j in i]))

    return lda, corpus_lda, top_clusters, top_wordonly

####################################################################### 

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
    lemma = line.split("	")[3]
    documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
             for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
    print i

推荐答案

为什么相同的LDA参数和语料每次都会生成不同的主题?

Why does the same LDA parameters and corpus generate different topics everytime?

因为 LDA 在训练和推理步骤中都使用随机性.

Because LDA uses randomness in both training and inference steps.

我如何稳定话题生成?

通过在每次训练模型或执行推理时将 numpy.random 种子重置为相同的值,使用 numpy.random.seed:

By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

(这很难看,而且它使 Gensim 结果难以重现;考虑提交补丁.我已经打开了一个 问题.)

(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)

这篇关于每次我在同一个语料库上训练时,LDA 模型都会生成不同的主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆