在火花中有效使用工会 [英] efficiently using union in spark

查看:102
本文介绍了在火花中有效使用工会的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scala和spark的新手,现在我有两个RDD,例如A是[(1,2 ,,(2,3)],B是[(4,5),(5,6)]和I想获得像[(1,2 ,,(2,3),(4,5),(5,6)]]的RDD.但是问题是我的数据很大,假设A和B均为10GB.我使用sc.union(A,B),但是速度很慢.我在Spark UI中看到此阶段有28308个任务.

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.

有没有更有效的方法?

推荐答案

为什么不将两个 RDD 转换为 dataframes 并使用 union 功能.
转换为 dataframe 很容易,您只需要 import sqlContext.implicits ._ 并应用 .toDF()函数和 header names .
例如:

Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:

    val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()

    val sqlContext = sparkSession.sqlContext

    var firstTableColumns = Seq("col1", "col2")
    var secondTableColumns = Seq("col3", "col4")

    import sqlContext.implicits._

    var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)

    var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)

    firstDF = firstDF.union(secondDF)

RDDs 相比,使用数据帧应该非常容易.将 dataframe 更改为 RDD 也很容易,只需调用 .rdd 函数

It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function

val rddData = firstDF.rdd

这篇关于在火花中有效使用工会的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆