了解Spark MLlib LDA输入格式 [英] Understanding Spark MLlib LDA input format

查看：250 发布时间：2020/4/30 8:39:24 apache-spark-mllib lda

本文介绍了了解Spark MLlib LDA输入格式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Spark MLlib实现LDA.

I am trying to implement LDA using Spark MLlib.

但是我很难理解输入格式.我能够运行其示例实现，以从仅包含数字的文件中获取输入，如下所示:

But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown :

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

我关注了 http://spark.apache.org/docs /latest/mllib-clustering.html#latent-dirichlet-allocation-lda

我了解此输出格式，如

I understand the output format of this as explained here.

我的用例非常简单，我有一个带有一些句子的数据文件. 我想将此文件转换为语料库，以便将其传递给org.apache.spark.mllib.clustering.LDA.run().

My use case is very simple, I have one data file with some sentences. I want to convert this file into corpus so that to pass it to org.apache.spark.mllib.clustering.LDA.run().

我的疑问是输入中的那些数字代表什么，然后是zipWithIndex并传递给LDA?就像出现在每个地方的数字1代表相同的单词还是某种计数?

My doubt is about what those numbers in input represent which is then zipWithIndex and passed to LDA? Is it like number 1 appearing everywhere represent same word or it is some kind of count?

推荐答案

首先，您需要将句子转换为向量.

First you need to convert your sentences into vectors.

val documents: RDD[Seq[String]] = sc.textFile("yourfile").map(_.split("      ").toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val corpus = tfidf.zipWithIndex.map(_.swap).cache()

 // Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

在此处了解更多有关TF_IDF矢量化的信息

Read more about TF_IDF vectorization here

这篇关于了解Spark MLlib LDA输入格式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

了解Spark MLlib LDA输入格式 [英] Understanding Spark MLlib LDA input format

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

了解Spark MLlib LDA输入格式 [英] Understanding Spark MLlib LDA input format

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭