Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么? [英] Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

查看:24
本文介绍了Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个想要加入的 RDD,它们看起来像这样:

I have two RDD's that I want to join and they look like this:

val rdd1:RDD[(T,U)]
val rdd2:RDD[((T,W), V)]

碰巧rdd1的key值是唯一的,rdd2的tuple-key值也是唯一的.我想加入这两个数据集,以便获得以下 rdd:

It happens to be the case that the key values of rdd1 are unique and also that the tuple-key values of rdd2 are unique. I'd like to join the two data sets so that I get the following rdd:

val rdd_joined:RDD[((T,W), (U,V))]

实现这一目标的最有效方法是什么?以下是我想到的一些想法.

What's the most efficient way to achieve this? Here are a few ideas I've thought of.

选项 1:

val m = rdd1.collectAsMap
val rdd_joined = rdd2.map({case ((t,w), u) => ((t,w), u, m.get(t))})

选项 2:

val distinct_w = rdd2.map({case ((t,w), u) => w}).distinct
val rdd_joined = rdd1.cartesian(distinct_w).join(rdd2)

选项 1 将收集所有数据以进行掌握,对吗?因此,如果 rdd1 很大(在我的情况下它相对较大,尽管比 rdd2 小一个数量级),这似乎不是一个好的选择.选项 2 做了一个丑陋的独特的笛卡尔积,这似乎也非常低效.我想到的另一种可能性(但还没有尝试过)是执行选项 1 并广播地图,尽管以智能"方式广播会更好,以便地图的键与rdd2 的键.

Option 1 will collect all of the data to master, right? So that doesn't seem like a good option if rdd1 is large (it's relatively large in my case, although an order of magnitude smaller than rdd2). Option 2 does an ugly distinct and cartesian product, which also seems very inefficient. Another possibility that crossed my mind (but haven't tried yet) is to do option 1 and broadcast the map, although it would be better to broadcast in a "smart" way so that the keys of the map are co-located with the keys of rdd2.

有没有人遇到过这种情况?我很乐意听取您的意见.

Has anyone come across this sort of situation before? I'd be happy to have your thoughts.

谢谢!

推荐答案

一种选择是通过将 rdd1 收集到驱动程序并将其广播给所有映射器来执行广播连接;如果做得正确,这将使我们避免对大型 rdd2 RDD 进行昂贵的洗牌:

One option is to perform a broadcast join by collecting rdd1 to the driver and broadcasting it to all mappers; done correctly, this will let us avoid an expensive shuffle of the large rdd2 RDD:

val rdd1 = sc.parallelize(Seq((1, "A"), (2, "B"), (3, "C")))
val rdd2 = sc.parallelize(Seq(((1, "Z"), 111), ((1, "ZZ"), 111), ((2, "Y"), 222), ((3, "X"), 333)))

val rdd1Broadcast = sc.broadcast(rdd1.collectAsMap())
val joined = rdd2.mapPartitions({ iter =>
  val m = rdd1Broadcast.value
  for {
    ((t, w), u) <- iter
    if m.contains(t)
  } yield ((t, w), (u, m.get(t).get))
}, preservesPartitioning = true)

preservesPartitioning = true 告诉 Spark 这个映射函数不会修改 rdd2 的键;这将允许 Spark 避免为任何基于 (t, w) 键加入的后续操作重新分区 rdd2.

The preservesPartitioning = true tells Spark that this map function doesn't modify the keys of rdd2; this will allow Spark to avoid re-partitioning rdd2 for any subsequent operations that join based on the (t, w) key.

此广播可能效率低下,因为它涉及驱动程序的通信瓶颈.原则上,可以在不涉及驱动程序的情况下将一个 RDD 广播给另一个;我有一个原型,我想将其概括并添加到 Spark 中.

This broadcast could be inefficient since it involves a communications bottleneck at the driver. In principle, it's possible to broadcast one RDD to another without involving the driver; I have a prototype of this that I'd like to generalize and add to Spark.

另一种选择是重新映射 rdd2 的键,并使用 Spark join 方法;这将涉及 rdd2(可能还有 rdd1)的完全洗牌:

Another option is to re-map the keys of rdd2 and use the Spark join method; this will involve a full shuffle of rdd2 (and possibly rdd1):

rdd1.join(rdd2.map {
  case ((t, w), u) => (t, (w, u))
}).map {
  case (t, (v, (w, u))) => ((t, w), (u, v))
}.collect()

在我的示例输入中,这两种方法产生相同的结果:

On my sample input, both of these methods produce the same result:

res1: Array[((Int, java.lang.String), (Int, java.lang.String))] = Array(((1,Z),(111,A)), ((1,ZZ),(111,A)), ((2,Y),(222,B)), ((3,X),(333,C)))

第三种选择是重构 rdd2 以便 t 是它的键,然后执行上面的连接.

A third option would be to restructure rdd2 so that t is its key, then perform the above join.

这篇关于Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆