将scala FP增长RDD输出转换为数据帧 [英] Convert scala FP-growth RDD output to Data frame
问题描述
https://spark .apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth
sample_fpgrowth.txt可以在这里找到, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt
sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt
我在上面的链接中以scala的形式运行了FP-growth示例,但我需要的是如何将RDD中的结果转换为数据帧. 这两个RDD
I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD
model.freqItemsets and
model.generateAssociationRules(minConfidence)
用我的问题中给出的示例详细解释这一点.
explain that in detail with the example given in my question.
推荐答案
有很多方法可以在创建rdd
后创建dataframe
.其中之一是使用.toDF
函数,该函数要求sqlContext.implicits
库作为imported
作为
There many ways to create a dataframe
once you have a rdd
. One of them is to use .toDF
function which requires sqlContext.implicits
library to be imported
as
val sparkSession = SparkSession.builder().appName("udf testings")
.master("local")
.config("", "")
.getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._
之后,您阅读fpgrowth
文本文件并将其隐藏为rdd
After that you read the fpgrowth
text file and covert into an rdd
val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)
下一步是调用.toDF
函数
对于第一个dataframe
model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)
这将导致
+---------+----+
|items |freq|
+---------+----+
|[z] |5 |
|[x] |4 |
|[x,z] |3 |
|[y] |3 |
|[y,x] |3 |
|[y,x,z] |3 |
|[y,z] |3 |
|[r] |3 |
|[r,x] |2 |
|[r,z] |2 |
|[s] |3 |
|[s,y] |2 |
|[s,y,x] |2 |
|[s,y,x,z]|2 |
|[s,y,z] |2 |
|[s,x] |3 |
|[s,x,z] |2 |
|[s,z] |2 |
|[t] |3 |
|[t,y] |3 |
+---------+----+
only showing top 20 rows
第二个dataframe
val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
.map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
.toDF("antecedent", "consequent", "confidence").show(false)
这将导致
+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y] |[x] |1.0 |
|[t,s,y] |[z] |1.0 |
|[y,x,z] |[t] |1.0 |
|[y] |[x] |1.0 |
|[y] |[z] |1.0 |
|[y] |[t] |1.0 |
|[p] |[r] |1.0 |
|[p] |[z] |1.0 |
|[q,t,z] |[y] |1.0 |
|[q,t,z] |[x] |1.0 |
|[q,y] |[x] |1.0 |
|[q,y] |[z] |1.0 |
|[q,y] |[t] |1.0 |
|[t,s,x] |[y] |1.0 |
|[t,s,x] |[z] |1.0 |
|[q,t,y,z] |[x] |1.0 |
|[q,t,x,z] |[y] |1.0 |
|[q,x] |[y] |1.0 |
|[q,x] |[t] |1.0 |
|[q,x] |[z] |1.0 |
+----------+----------+----------+
only showing top 20 rows
我希望这是您所需要的
这篇关于将scala FP增长RDD输出转换为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!