从TF-IDF到Spark，Pyspark中的LDA群集 [英] From TF-IDF to LDA clustering in spark, pyspark

查看：314 发布时间：2020/4/30 8:39:13 python apache-spark pyspark tf-idf lda

本文介绍了从TF-IDF到Spark，Pyspark中的LDA群集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对存储在格式关键字listofwords中的推文进行聚类

I am trying to cluster tweets stored in the format key,listofwords

我的第一步是使用数据框为

My first step has been to extract TF-IDF values for the list of words using dataframe with

dbURL = "hdfs://pathtodir"  
file = sc.textFile(dbURL)
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)

按照为火花中的LDA准备数据的建议，我试图根据此示例，我的开头是:

Following the suggestion from Preparing data for LDA in spark I tried to reformat the output to what I expect to be an input to LDA, based on this example, I started as:

indexer = StringIndexer(inputCol='key',outputCol='KeyIndex')
indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures')

但是现在我无法找到一种很好的方法将数据框转换为上一示例或

But now I do not manage to find a good way to turn my dataframe into the format proposed in previous example or in this example

如果有人可以将我指向正确的地方或者在我的方法错误的情况下可以纠正我，我将不胜感激.

I would be very grateful if someone could point me to the correct place to look at or could correct me if my approach is wrong.

我认为从一系列文档中提取TF-IDS向量并将它们聚类应该是一件很经典的事情，但是我没有找到一种简单的方法.

I supposed that extracting TF-IDS vectors from a series of documents and clustering them should be a fairly classical thing to do but I fail to find an easy way to do it.

从TF-IDF到Spark，Pyspark中的LDA群集 [英] From TF-IDF to LDA clustering in spark, pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从TF-IDF到Spark，Pyspark中的LDA群集 [英] From TF-IDF to LDA clustering in spark, pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭