Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么? [英] Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

查看：24 发布时间：2021/11/12 5:31:41 scala apache-spark

本文介绍了Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个想要加入的 RDD，它们看起来像这样:

I have two RDD's that I want to join and they look like this:

val rdd1:RDD[(T,U)]
val rdd2:RDD[((T,W), V)]

碰巧rdd1的key值是唯一的，rdd2的tuple-key值也是唯一的.我想加入这两个数据集，以便获得以下 rdd:

It happens to be the case that the key values of rdd1 are unique and also that the tuple-key values of rdd2 are unique. I'd like to join the two data sets so that I get the following rdd:

val rdd_joined:RDD[((T,W), (U,V))]

实现这一目标的最有效方法是什么?以下是我想到的一些想法.

What's the most efficient way to achieve this? Here are a few ideas I've thought of.

选项 1:

val m = rdd1.collectAsMap
val rdd_joined = rdd2.map({case ((t,w), u) => ((t,w), u, m.get(t))})

选项 2:

val distinct_w = rdd2.map({case ((t,w), u) => w}).distinct
val rdd_joined = rdd1.cartesian(distinct_w).join(rdd2)

选项 1 将收集所有数据以进行掌握，对吗?因此，如果 rdd1 很大(在我的情况下它相对较大，尽管比 rdd2 小一个数量级)，这似乎不是一个好的选择.选项 2 做了一个丑陋的独特的笛卡尔积，这似乎也非常低效.我想到的另一种可能性(但还没有尝试过)是执行选项 1 并广播地图，尽管以智能"方式广播会更好，以便地图的键与rdd2 的键.

Option 1 will collect all of the data to master, right? So that doesn't seem like a good option if rdd1 is large (it's relatively large in my case, although an order of magnitude smaller than rdd2). Option 2 does an ugly distinct and cartesian product, which also seems very inefficient. Another possibility that crossed my mind (but haven't tried yet) is to do option 1 and broadcast the map, although it would be better to broadcast in a "smart" way so that the keys of the map are co-located with the keys of rdd2.

有没有人遇到过这种情况?我很乐意听取您的意见.

Has anyone come across this sort of situation before? I'd be happy to have your thoughts.

谢谢！

推荐答案

一种选择是通过将 rdd1 收集到驱动程序并将其广播给所有映射器来执行广播连接；如果做得正确，这将使我们避免对大型 rdd2 RDD 进行昂贵的洗牌:

One option is to perform a broadcast join by collecting rdd1 to the driver and broadcasting it to all mappers; done correctly, this will let us avoid an expensive shuffle of the large rdd2 RDD:

val rdd1 = sc.parallelize(Seq((1, "A"), (2, "B"), (3, "C")))
val rdd2 = sc.parallelize(Seq(((1, "Z"), 111), ((1, "ZZ"), 111), ((2, "Y"), 222), ((3, "X"), 333)))

val rdd1Broadcast = sc.broadcast(rdd1.collectAsMap())
val joined = rdd2.mapPartitions({ iter =>
  val m = rdd1Broadcast.value
  for {
    ((t, w), u) <- iter
    if m.contains(t)
  } yield ((t, w), (u, m.get(t).get))
}, preservesPartitioning = true)

preservesPartitioning = true 告诉 Spark 这个映射函数不会修改 rdd2 的键；这将允许 Spark 避免为任何基于 (t, w) 键加入的后续操作重新分区 rdd2.

The preservesPartitioning = true tells Spark that this map function doesn't modify the keys of rdd2; this will allow Spark to avoid re-partitioning rdd2 for any subsequent operations that join based on the (t, w) key.

此广播可能效率低下，因为它涉及驱动程序的通信瓶颈.原则上，可以在不涉及驱动程序的情况下将一个 RDD 广播给另一个；我有一个原型，我想将其概括并添加到 Spark 中.

This broadcast could be inefficient since it involves a communications bottleneck at the driver. In principle, it's possible to broadcast one RDD to another without involving the driver; I have a prototype of this that I'd like to generalize and add to Spark.

另一种选择是重新映射 rdd2 的键，并使用 Spark join 方法；这将涉及 rdd2(可能还有 rdd1)的完全洗牌:

Another option is to re-map the keys of rdd2 and use the Spark join method; this will involve a full shuffle of rdd2 (and possibly rdd1):

rdd1.join(rdd2.map {
  case ((t, w), u) => (t, (w, u))
}).map {
  case (t, (v, (w, u))) => ((t, w), (u, v))
}.collect()

在我的示例输入中，这两种方法产生相同的结果:

On my sample input, both of these methods produce the same result:

res1: Array[((Int, java.lang.String), (Int, java.lang.String))] = Array(((1,Z),(111,A)), ((1,ZZ),(111,A)), ((2,Y),(222,B)), ((3,X),(333,C)))

第三种选择是重构 rdd2 以便 t 是它的键，然后执行上面的连接.

A third option would be to restructure rdd2 so that t is its key, then perform the above join.

这篇关于Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么? [英] Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么? [英] Spark: what&#39;s the best strategy for joining a 2-tuple-key RDD with single-key RDD?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Spark:将 2 元组键 RDD 与单键 RDD 结合的最佳策略是什么? [英] Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

登录关闭