Spark中的潜在Dirichlet分配(LDA)-复制模型 [英] Latent Dirichlet allocation (LDA) in Spark - replicate model

查看：129 发布时间：2020/4/30 8:38:38 apache-spark pyspark lda

本文介绍了Spark中的潜在Dirichlet分配(LDA)-复制模型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从pyspark ml-clustering包中保存LDA模型，并将该模型应用于训练和测试.保存后测试数据集.然而，尽管播下了种子，结果却有所不同.我的代码如下:

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:

1)导入软件包

from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF

2)准备数据集

countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())    
corpus = result_tfidf.select("id", "features")

3)训练LDA模型

lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)  
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)

4)复制模型

#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)   
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")

#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)  
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)

尽管我认为是相等的，但以下结果是不同的: topics_rdd != topics_rdd_new和modelled_corpus != modelled_corpus_new(同样，在检查提取的主题时，它们与数据集上的预测类也不相同)

The following results are different despite my assumption to be equal: topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)

因此，即使我在模型生成中设置了种子，我还是觉得奇怪的是，同一模型在同一数据集上预测了不同的类(主题").有复制LDA模型经验的人可以提供帮助吗?

So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?

谢谢:)

Spark中的潜在Dirichlet分配(LDA)-复制模型 [英] Latent Dirichlet allocation (LDA) in Spark - replicate model

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark中的潜在Dirichlet分配(LDA)-复制模型 [英] Latent Dirichlet allocation (LDA) in Spark - replicate model

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭