Spark RDD groupByKey + join vs join 性能 [英] Spark RDD groupByKey + join vs join performance

查看:35
本文介绍了Spark RDD groupByKey + join vs join 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在与其他用户共享的集群上使用 Spark.因此,仅根据运行时间来判断我的哪个代码运行效率更高是不可靠的.因为当我运行更高效的代码时,其他人可能会运行大量数据并使我的代码执行更长时间.

I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time.

我可以在这里问两个问题吗:

So can I ask 2 questions here:

  1. 我正在使用 join 函数来加入 2 个 RDD 并且我在使用 之前尝试使用 groupByKey()加入,像这样:

  1. I was using join function to join 2 RDDsand I am trying to use groupByKey() before using join, like this:

rdd1.groupByKey().join(rdd2)

似乎花费了更长的时间,但是我记得当我使用 Hadoop Hive 时,group by 使我的查询运行得更快.由于 Spark 使用惰性求值,我想知道 groupByKeyjoin 之前是否会使事情更快

seems that it took longer time, however I remember when I was using Hadoop Hive, the group by made my query ran faster. Since Spark is using lazy evaluation, I am wondering whether groupByKey before join makes things faster

我注意到 Spark 有一个 SQL 模块,目前我真的没有时间去尝试它,但是我能问一下 SQL 模块和 RDD SQL like 函数有什么区别吗?

I have noticed Spark has a SQL module, so far I really don't have time to try it, but can I ask what are the differences between the SQL module and RDD SQL like functions?

推荐答案

  1. groupByKey 后跟 join 没有比单独的 join 更快的充分理由.如果 rdd1rdd2 没有分区器或分区器不同,那么限制因素就是 HashPartitioning 所需的简单改组.

  1. There is no good reason for groupByKey followed by join to be faster than join alone. If rdd1 and rdd2 have no partitioner or partitioners differ then a limiting factor is simply shuffling required for HashPartitioning.

通过使用 groupByKey,您不仅可以通过保留分组所需的可变缓冲区来增加总成本,而且更重要的是您使用了额外的转换,这会导致更复杂的 DAG.groupByKey + join:

By using groupByKey you not only increase a total cost by keeping mutable buffers required for grouping but what is more important you use an additional transformation which results in a more complex DAG. groupByKey + join:

rdd1 = sc.parallelize([("a", 1), ("a", 3), ("b", 2)])
rdd2 = sc.parallelize([("a", 5), ("c", 6), ("b", 7)])
rdd1.groupByKey().join(rdd2)

对比加入单独:

rdd1.join(rdd2)

最后,这两个计划甚至不是等价的,要获得相同的结果,您必须向第一个计划添加额外的 flatMap.

Finally these two plans are not even equivalent and to get the same results you have to add an additional flatMap to the first one.

这是一个相当广泛的问题,但要强调主要区别:

This is a quite broad question but to highlight the main differences:

  • PairwiseRDDs 是任意Tuple2 元素的同构集合.对于默认操作,您希望 key 以有意义的方式可散列,否则对类型没有严格要求.相比之下,DataFrames 表现出更多的动态类型,但每列只能包含来自 支持的一组定义类型.可以定义UDT,但仍然必须使用基本的来表达.

  • PairwiseRDDs are homogeneous collections of arbitraryTuple2 elements. For default operations you want key to be hashable in a meaningful way otherwise there are no strict requirements regarding the type. In contrast DataFrames exhibit much more dynamic typing but each column can only contain values from a supported set of defined types. It is possible to define UDT but it still has to be expressed using basic ones.

DataFrames 使用 Catalyst Optimizer 生成逻辑和物理执行计划,并且可以生成高度优化的查询,而无需应用手动低级优化.基于 RDD 的操作简单地遵循依赖 DAG.这意味着在没有自定义优化的情况下性能更差,但对执行的控制要好得多,并且有一些微调的潜力.

DataFrames use a Catalyst Optimizer which generates logical and physical execution planss and can generate highly optimized queries without need for applying manual low level optimizations. RDD based operations simply follow dependency DAG. It means worse performance without custom optimization but much better control over execution and some potential for fine graded tuning.

其他一些需要阅读的内容:

Some other things to read:

这篇关于Spark RDD groupByKey + join vs join 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆