Spark 2.1.1:如何在Spark 2.1.1中已经训练有素的LDA模型上预测看不见的文档中的主题? [英] Spark 2.1.1: How to predict topics in unseen documents on already trained LDA model in Spark 2.1.1?

查看:70
本文介绍了Spark 2.1.1:如何在Spark 2.1.1中已经训练有素的LDA模型上预测看不见的文档中的主题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在pyspark(spark 2.1.1)的客户评论数据集中训练一个LDA模型.现在,基于该模型,我想预测看不见的新文本中的主题.

I am training an LDA model in pyspark (spark 2.1.1) on a customers review dataset. Now based on that model I want to predict the topics in the new unseen text.

我正在使用以下代码制作模型

I am using the following code to make the model

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.ml.clustering import DistributedLDAModel, LocalLDAModel
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.sql.functions import *
import pyspark.sql.functions as F


path = "D:/sparkdata/sample_text_LDA.txt"
sc = SparkContext("local[*]", "review")
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)

data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()

docDF = spark.createDataFrame(data)
remover = StopWordsRemover(inputCol="words",
outputCol="stopWordsRemoved")
stopWordsRemoved_df = remover.transform(docDF).cache()
Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")
model = Vector.fit(stopWordsRemoved_df)
result = model.transform(stopWordsRemoved_df)
corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()

# Cluster the documents topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary
print(ldaModel.describeTopics())
wordNumbers = 10  # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))
def topic_render(topic):  # specify vector id of words to actual words
   terms = topic[0]
   result = []
   for i in range(wordNumbers):
       term = vocabArray[terms[i]]
       result.append(term)
   return result

topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()

for topic in range(len(topics_final)):
   print("Topic" + str(topic) + ":")
   for term in topics_final[topic]:
       print (term)
   print ('\n')

现在,我有一个带有新客户评论的列的数据框,并且我想预测它们属于哪个主题集群. 我已经搜索了答案,大多建议采用以下方式,如此处

Now I have a dataframe with a column having new customer reviews and I want to predict that to which topic cluster they belong. I have searched for answers, mostly the following way is recommended, as here Spark MLlib LDA, how to infer the topics distribution of a new unseen document?.

newDocuments: RDD[(Long, Vector)] = ...
topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

但是,出现以下错误:

'LDAModel'对象没有属性'toLocal'. 它也没有topicDistribution属性.

'LDAModel' object has no attribute 'toLocal'. Neither do it have topicDistribution attribute.

那么Spark 2.1.1不支持这些属性吗?

So are these attributes not supported in spark 2.1.1?

还有其他方法可以从看不见的数据中推断出话题吗?

So any other way to infer topics from the unseen data?

推荐答案

您将需要预处理新数据:

You're going to need to pre-process the new data:

# import a new data set to be passed through the pre-trained LDA

data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index

documents_new = data_text_new
#documents_new = documents.dropna(subset=['Preprocessed Document'])

# process the new data set through the lemmatization, and stopwork functions
processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)

# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

然后,您可以将其作为功能通过训练有素的LDA.您需要的只是bow_corpus:

Then you can just pass it through the trained LDA as a function. All you need is that bow_corpus:

ldamodel[bow_corpus_new[:len(bow_corpus_new)]]

如果要在csv中使用它,请尝试以下操作:

If you want it out in a csv try this:

a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
    topic_0.append(i[0][1])
    topic_1.append(i[1][1])
    topic_2.append(i[2][1])
    
d = {'Your Target Column': b['Your Target Column'].tolist(),
     'topic_0': topic_0,
     'topic_1': topic_1,
     'topic_2': topic_2}
     
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')

我希望这会有所帮助:)

I hope this helps :)

这篇关于Spark 2.1.1:如何在Spark 2.1.1中已经训练有素的LDA模型上预测看不见的文档中的主题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆