每当我在同一个语料库上训练时,LDA模型就会生成不同的主题 [英] LDA model generates different topics everytime i train on the same corpus

查看:470
本文介绍了每当我在同一个语料库上训练时,LDA模型就会生成不同的主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python gensim从一个只有231个句子的小型语料库中训练潜在Dirichlet分配(LDA)模型.但是,每次我重复该过程时,都会产生不同的主题.

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.

为什么相同的LDA参数和语料库每次都会生成不同的主题?

我如何稳定主题的产生?

我正在使用该语料库( http://pastebin.com/WptkKVF0 )和此停用词列表(http://pastebin.com/LL7dqLcj ),这是我的代码:

I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math

stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]

def generateTopics(corpus, dictionary):
    # Build LDA model using the above corpus
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
    corpus_lda = lda[corpus]

    # Group topics with similar words together.
    tops = set(lda.show_topics(50))
    top_clusters = []
    for l in tops:
        top = []
        for t in l.split(" + "):
            top.append((t.split("*")[0], t.split("*")[1]))
        top_clusters.append(top)

    # Generate word only topics
    top_wordonly = []
    for i in top_clusters:
        top_wordonly.append(":".join([j[1] for j in i]))

    return lda, corpus_lda, top_clusters, top_wordonly

####################################################################### 

# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
    lemma = line.split("\t")[3]
    documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
             for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)

for i in topic_wordonly:
    print i

推荐答案

为什么相同的LDA参数和语料库每次都会生成不同的主题?

Why does the same LDA parameters and corpus generate different topics everytime?

因为LDA在训练和推理步骤中都使用了随机性.

Because LDA uses randomness in both training and inference steps.

我该如何稳定话题的产生?

And how do i stabilize the topic generation?

每次使用numpy.random.seed进行模型训练或推理时,通过将numpy.random种子重置为相同的值:

By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

(这很丑陋,这使得Gensim结果难以再现;请考虑提交补丁.我已经打开了问题.)

(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)

这篇关于每当我在同一个语料库上训练时,LDA模型就会生成不同的主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆