有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布? [英] Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

查看:103
本文介绍了有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种方法可以使用预先训练的LDA模型而不使用LDA_Model [unseenDoc]语法来获取未见文档的主题分布?我正在尝试将LDA模型实现到Web应用程序中,如果可以使用矩阵乘法来获得相似的结果,则可以在javascript中使用该模型.

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.

例如,我尝试了以下操作:

For example, I tried the following:

import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')


def Preprocesser(text_list):

    smallestWordSize = 3
    processedList = []

    for token in gensim.utils.simple_preprocess(text_list):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
            processedList.append(StemmAndLemmatize(token))

    return processedList

lda_model = models.LdaModel.load('LDAModel\GoldModel')  #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict")      #Load dictionary model was trained on

#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"

termTopicMatrix = lda_model.get_topics()    #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc)                #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc)       #Create bow using dictionary
dictSize = len(termTopicMatrix[0])          #Get length of terms in dictionary
fullDict = np.zeros(dictSize)               #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc]      #Get index of terms in bag of words
Second = [second[1] for second in bowDoc]   #Get frequency of term in bag of words
fullDict[First] = Second                    #Add word frequency to full dictionary


print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])

Output:
Matrix Multiplication: 
 [0.0283254  0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
 0.01558603 0.0370233  0.04648389 0.02887623 0.00776652 0.02147539
 0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
 0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
 0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
 0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax: 
 [(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]

在预训练模型中,包含35个主题和1155个单词.

In the pretrained model there are 35 topics and 1155 words.

在常规语法"输出中,每个元组的第一个元素是主题的索引,第二个元素是主题的概率.在矩阵乘法"版本中,概率是指数,而值是概率.显然,两者不匹配.

In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.

例如,lda_model [unseenDoc]显示主题0的概率为0.07,但是矩阵乘法方法表明主题的概率为0.028.我在这里错过了一步吗?

For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?

推荐答案

您可以在安装过程中查看 LDAModel get_document_topics()方法使用的完整源代码.,或在线访问:

You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:

https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283

(它还使用同一文件中的 inference()方法.)

(It also makes use of the inference() method in the same file.)

与代码相比,它进行的缩放/归一化/剪切操作要多得多,这很可能是导致差异的原因.但是您应该能够逐行检查您的流程和获取匹配步骤的方法有所不同.

It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.

使用gensim代码的步骤作为创建并行Javascript代码的指南也应该很困难,在模型状态正确的部分,该Javascript代码可以重现其结果.

It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

这篇关于有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆