Spark中的潜在Dirichlet分配(LDA)-复制模型 [英] Latent Dirichlet allocation (LDA) in Spark - replicate model

查看:129
本文介绍了Spark中的潜在Dirichlet分配(LDA)-复制模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从pyspark ml-clustering包中保存LDA模型,并将该模型应用于训练和测试.保存后测试数据集.然而,尽管播下了种子,结果却有所不同.我的代码如下:

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:

1)导入软件包

from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF

2)准备数据集

countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())    
corpus = result_tfidf.select("id", "features")

3)训练LDA模型

lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)  
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)

4)复制模型

#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)   
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")

#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)  
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)

尽管我认为是相等的,但以下结果是不同的: topics_rdd != topics_rdd_newmodelled_corpus != modelled_corpus_new(同样,在检查提取的主题时,它们与数据集上的预测类也不相同)

The following results are different despite my assumption to be equal: topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)

因此,即使我在模型生成中设置了种子,我还是觉得奇怪的是,同一模型在同一数据集上预测了不同的类(主题").有复制LDA模型经验的人可以提供帮助吗?

So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?

谢谢:)

推荐答案

在PYSPARK中实施LDA时,我遇到了类似的问题.即使我使用的是种子,每次我对具有相同参数的相同数据重新运行代码时,结果都是不同的.

I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.

尝试了多种方法后,我想出了以下解决方案:

I came up with below solution after trying multitude of things:

  1. 运行一次后保存 cv_model ,并在下一次迭代中加载它,然后重新调整它.

  1. Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.

这与我的数据集更相关.我使用的语料库中某些文档的大小很小(每个文档大约3个单词).我过滤掉了这些文档并设置了一个限制,以使只有那些文档至少包含15个单词(可能更高)会包含在语料库中.我不确定为什么这个方法可行,可能与模型的下划线复杂度有关.

This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.

总而言之,即使经过多次迭代,我的结果也是相同的.希望这可以帮助.

All in all now my results are same even after several iterations. Hope this helps.

这篇关于Spark中的潜在Dirichlet分配(LDA)-复制模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆