已保存的Gensim LdaMallet模型无法在其他控制台中运行 [英] Saved Gensim LdaMallet model not working in different console
问题描述
我正在用python训练ldamallet模型并保存它.我还保存了培训词典,以后可用于为看不见的文档创建语料库.如果我在同一控制台中执行所有操作(即训练模型,保存训练后的模型,加载已保存的模型,推断看不见的语料库),则一切正常.但是,我想在其他控制台/计算机上使用经过训练的模型.
I am training a ldamallet model in python and saving it. I am also saving training dictionary that I can use to create corpus for unseen documents later. If I perform every action (i.e. train a model, save trained model, load saved model, infer unseen corpus) within same console, everything works fine. However, I want to use the trained model in different console / computer.
我在训练时传递了前缀,以查看由模型创建的临时文件.训练模型后,将创建以下文件:
I passed prefix while training to look at the temp files created by the model. Following files are created when the model is trained:
'corpus.mallet'
'corpus.mallet'
'corpus.txt'
'corpus.txt'
'doctopics'txt'
'doctopics'txt'
inferencer.mallet'
inferencer.mallet'
'state.mallet.gz'
'state.mallet.gz'
'topickeys.txt'
'topickeys.txt'
现在,当我将保存的模型加载到其他控制台中并推断出使用保存的字典创建的看不见的语料库时,我看不到正在创建任何其他临时文件,并产生以下错误:
Now when I load the saved model in a different console and infer unseen corpus created using the saved dictionary, I can see no other temp files being created and produces following error:
FileNotFounderror: No such file or directory : 'my_directory\\doctopics.txt.infer'
出于某种奇怪的原因,如果我将保存的模型加载到同一控制台(已在其上受训的控制台)上,并像上面那样推断出看不见的语料库,则会更新"corpus.txt"并创建两个新的临时文件:
For some odd reason, if I load the saved model in same console (console it was trained on) and infer unseen corpus like above, 'corpus.txt' is updated and two new temp files are created:
'corpus.mallet.infer'
'corpus.mallet.infer'
'doctopics.txt.infer'
'doctopics.txt.infer'
知道为什么我会遇到这个问题吗?
Any idea why I might be having this issue?
我尝试使用LdaModel代替LdaMallet,并且无论我是在同一控制台中还是在不同控制台中执行整个任务,LdaModel都可以正常工作.
I have tried using LdaModel instead of LdaMallet and LdaModel works fine irrespective of whether I perform whole task in same console or different console.
下面是我正在使用的代码段.
Below is the snippet of the code I am using.
def find_optimum_model(self):
lemmatized_words = self.lemmatization()
id2word = corpora.Dictionary(lemmatized_words)
all_corpus = [id2word.doc2bow(text) for text in lemmatized_words]
#For two lines below update with your path to new_mallet
os.environ['MALLET_HOME'] = r'C:\\users\\axk0er8\\Sentiment_Analysis_Working\\new_mallet\\mallet-2.0.8'
mallet_path = r'C:\\users\\axk0er8\\Sentiment_Analysis_Working\\new_mallet\\mallet-2.0.8\\bin\\mallet.bat'
prefix_path = r'C:\\users\\axk0er8\\Sentiment_Analysis_Working\\new_mallet\\mallet_temp\\'
def compute_coherence_values(dictionary, all_corpus, texts, limit, start=2, step=4):
coherence_values = []
model_list = []
num_topics_list = []
for num_topics in range(start, limit, step):
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=all_corpus, num_topics=num_topics, id2word=dictionary,
random_seed=42)
#model = gensim.models.ldamodel.LdaModel(corpus=all_corpus,num_topics=num_topics,id2word=dictionary,eval_every=1,
# alpha='auto',random_state=42)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
num_topics_list.append(num_topics)
return model_list, coherence_values, num_topics_list
model_list, coherence_values, num_topics_list = compute_coherence_values(dictionary=id2word,all_corpus=all_corpus,
texts=lemmatized_words,start=5,limit=40, step=6)
model_values_df = pd.DataFrame({'model_list':model_list,'coherence_values':coherence_values,'num_topics':num_topics_list})
optimal_num_topics = model_values_df.loc[model_values_df['coherence_values'].idxmax()]['num_topics']
optimal_model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=all_corpus, num_topics=optimal_num_topics, id2word=id2word,
prefix=prefix_path, random_seed=42)
#joblib.dump(id2word,'id2word_dictionary_mallet.pkl')
#joblib.dump(optimal_model,'optimal_ldamallet_model.pkl')
id2word.save('id2word_dictionary.gensim')
optimal_model.save('optimal_lda_model.gensim')
def generate_dominant_topic(self):
lemmatized_words = self.lemmatization()
id2word = corpora.Dictionary.load('id2word_dictionary.gensim')
#id2word = joblib.load('id2word_dictionary_mallet.pkl')
new_corpus = [id2word.doc2bow(text) for text in lemmatized_words]
optimal_model = gensim.models.wrappers.LdaMallet.load('optimal_lda_model.gensim')
#optimal_model = joblib.load('optimal_ldamallet_model.pkl')
def format_topics_sentences(ldamodel, new_corpus):
sent_topics_df = pd.DataFrame()
for i, row in enumerate(ldamodel[new_corpus]):
row = sorted(row, key=lambda x: (x[1]), reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0:
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]),
ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
return (sent_topics_df)
我的期望是对训练数据使用find_optimum_model
功能,并保存最佳模型和字典.保存后,使用generate_dominant_topic
函数加载保存的模型和字典,为看不见的文本创建语料库,然后运行模型以获取所需的主题建模输出.
My expectation is use find_optimum_model
function with the training data and save the optimum model and dictionary. Once saved, use generate_dominant_topic
function to load saved model and dictionary, create corpus for unseen text and run the model to get desired topic modeling output.
推荐答案
加载模型后,您可以像这样指定新的前缀路径:
After loading the model, you can specify the new prefix path like so:
ldamodel.prefix = 'path/to/new/prefix'
这篇关于已保存的Gensim LdaMallet模型无法在其他控制台中运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!