有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布? [英] Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

查看：103 发布时间：2021/5/10 19:07:43 gensim lda topic-modeling

本文介绍了有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有一种方法可以使用预先训练的LDA模型而不使用LDA_Model [unseenDoc]语法来获取未见文档的主题分布?我正在尝试将LDA模型实现到Web应用程序中，如果可以使用矩阵乘法来获得相似的结果，则可以在javascript中使用该模型.

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.

例如，我尝试了以下操作:

For example, I tried the following:

import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')


def Preprocesser(text_list):

    smallestWordSize = 3
    processedList = []

    for token in gensim.utils.simple_preprocess(text_list):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
            processedList.append(StemmAndLemmatize(token))

    return processedList

lda_model = models.LdaModel.load('LDAModel\GoldModel')  #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict")      #Load dictionary model was trained on

#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"

termTopicMatrix = lda_model.get_topics()    #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc)                #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc)       #Create bow using dictionary
dictSize = len(termTopicMatrix[0])          #Get length of terms in dictionary
fullDict = np.zeros(dictSize)               #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc]      #Get index of terms in bag of words
Second = [second[1] for second in bowDoc]   #Get frequency of term in bag of words
fullDict[First] = Second                    #Add word frequency to full dictionary


print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])

Output:
Matrix Multiplication: 
 [0.0283254  0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
 0.01558603 0.0370233  0.04648389 0.02887623 0.00776652 0.02147539
 0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
 0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
 0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
 0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax: 
 [(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]

在预训练模型中，包含35个主题和1155个单词.

In the pretrained model there are 35 topics and 1155 words.

在常规语法"输出中，每个元组的第一个元素是主题的索引，第二个元素是主题的概率.在矩阵乘法"版本中，概率是指数，而值是概率.显然，两者不匹配.

In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.

例如，lda_model [unseenDoc]显示主题0的概率为0.07，但是矩阵乘法方法表明主题的概率为0.028.我在这里错过了一步吗?

For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?

有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布? [英] Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布? [英] Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭