将scala FP增长RDD输出转换为数据帧 [英] Convert scala FP-growth RDD output to Data frame

查看:71
本文介绍了将scala FP增长RDD输出转换为数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

https://spark .apache.org/docs/2.1.0/mllib-frequent-pattern-mining.html#fp-growth

sample_fpgrowth.txt可以在这里找到, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt

sample_fpgrowth.txt can be found here, https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt

我在上面的链接中以scala的形式运行了FP-growth示例,但我需要的是如何将RDD中的结果转换为数据帧. 这两个RDD

I ran the FP-growth example in the link above in scala its working fine, but what i need is, how to convert the result which is in RDD to data frame. Both these RDD

 model.freqItemsets and 
 model.generateAssociationRules(minConfidence)

用我的问题中给出的示例详细解释这一点.

explain that in detail with the example given in my question.

推荐答案

有很多方法可以在创建rdd后创建dataframe.其中之一是使用.toDF函数,该函数要求sqlContext.implicits库作为imported作为

There many ways to create a dataframe once you have a rdd. One of them is to use .toDF function which requires sqlContext.implicits library to be imported as

val sparkSession = SparkSession.builder().appName("udf testings")
  .master("local")
  .config("", "")
  .getOrCreate()
val sc = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
import sqlContext.implicits._

之后,您阅读fpgrowth文本文件并将其隐藏为rdd

After that you read the fpgrowth text file and covert into an rdd

    val data = sc.textFile("path to sample_fpgrowth.txt that you have used")
    val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

我使用了问题中提供的>频繁模式挖掘-基于RDD的API

val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

下一步是调用.toDF函数

对于第一个dataframe

model.freqItemsets.map(itemset =>(itemset.items.mkString("[", ",", "]") , itemset.freq)).toDF("items", "freq").show(false)

这将导致

+---------+----+
|items    |freq|
+---------+----+
|[z]      |5   |
|[x]      |4   |
|[x,z]    |3   |
|[y]      |3   |
|[y,x]    |3   |
|[y,x,z]  |3   |
|[y,z]    |3   |
|[r]      |3   |
|[r,x]    |2   |
|[r,z]    |2   |
|[s]      |3   |
|[s,y]    |2   |
|[s,y,x]  |2   |
|[s,y,x,z]|2   |
|[s,y,z]  |2   |
|[s,x]    |3   |
|[s,x,z]  |2   |
|[s,z]    |2   |
|[t]      |3   |
|[t,y]    |3   |
+---------+----+
only showing top 20 rows

第二个dataframe

val minConfidence = 0.8
model.generateAssociationRules(minConfidence)
  .map(rule =>(rule.antecedent.mkString("[", ",", "]"), rule.consequent.mkString("[", ",", "]"), rule.confidence))
  .toDF("antecedent", "consequent", "confidence").show(false)

这将导致

+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
|[t,s,y]   |[x]       |1.0       |
|[t,s,y]   |[z]       |1.0       |
|[y,x,z]   |[t]       |1.0       |
|[y]       |[x]       |1.0       |
|[y]       |[z]       |1.0       |
|[y]       |[t]       |1.0       |
|[p]       |[r]       |1.0       |
|[p]       |[z]       |1.0       |
|[q,t,z]   |[y]       |1.0       |
|[q,t,z]   |[x]       |1.0       |
|[q,y]     |[x]       |1.0       |
|[q,y]     |[z]       |1.0       |
|[q,y]     |[t]       |1.0       |
|[t,s,x]   |[y]       |1.0       |
|[t,s,x]   |[z]       |1.0       |
|[q,t,y,z] |[x]       |1.0       |
|[q,t,x,z] |[y]       |1.0       |
|[q,x]     |[y]       |1.0       |
|[q,x]     |[t]       |1.0       |
|[q,x]     |[z]       |1.0       |
+----------+----------+----------+
only showing top 20 rows

我希望这是您所需要的

这篇关于将scala FP增长RDD输出转换为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆