星火RDD groupByKey +加盟VS连接性能 [英] Spark RDD groupByKey + join vs join performance
问题描述
我使用的,我与他人用户共享群集上的火花。因此,它是不可靠的,告诉我哪个code之一正是基于运行时间运行更加高效。因为当我运行更高效的code,别人可能运行大量数据的作品,使我的code执行时间较长。
所以我可以在这里问两个问题:
-
我用
加入
功能的加入2RDDS
,我试图用groupByKey()
使用前加入
,就像这样:rdd1.groupByKey()。加入(RDD2)
似乎花了更长的时间,但是我记得我使用Hadoop配置单元,该集团通过了我的查询跑得更快时。由于星火使用懒惰的评价,我想知道是否
groupByKey
在加入
使事情更快 -
我注意到星火有一个SQL模块,到目前为止,我真的没有时间去尝试,但我可以问是什么SQL模块与RDD SQL类的函数?之间的区别
-
有没有好的理由
groupByKey
然后按加入
要快于加入
孤单。如果RDD1集
和RDD2
没有分区或再partitioners不同的是简单的洗牌需要HashPartitioning
。通过使用
groupByKey
你不仅通过保持对分组所需的可变缓冲区增加的总成本,但什么是你使用一个额外的转换导致更复杂的DAG更重要。groupByKey
+加入
:RDD1集= sc.parallelize([(一,1),(一,3),(B,2)])
RDD2 = sc.parallelize([(一,5),(C,6),(B,7)])
rdd1.groupByKey()。加入(RDD2)VS。
加入
独自:rdd1.join(RDD2)
最后这两个方案甚至不等价,并得到你必须添加相同业绩的额外
flatMap
来的第一个。 -
这只是突出的主要差别相当宽泛的问题:
-
PairwiseRDDs
是任意Tuple2
元素的均匀的集合。对于默认操作你想关键是哈希以有意义的方式,否则有相关的类型没有严格的要求。相比之下DataFrames表现出更多的动态类型,但每列只能从的支持的设置定义类型的的。它可以定义 UDT 的,但它仍然必须是前$ P $使用最基本的pssed -
DataFrames使用的催化剂优化产生的逻辑和物理执行planss并能生成而不需要高度优化的查询申请手动低级别的优化。基于RDD操作只需按照依赖DAG。这意味着无需自定义优化性能越差,但在执行的更好控制和精细分级调整一些潜在的。
-
一些其他的事情阅读:
I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time.
So can I ask 2 questions here:
I was using
join
function to join 2RDDs
and I am trying to usegroupByKey()
before usingjoin
, like this:rdd1.groupByKey().join(rdd2)
seems that it took longer time, however I remember when I was using Hadoop Hive, the group by made my query ran faster. Since Spark is using lazy evaluation, I am wondering whether
groupByKey
beforejoin
makes things fasterI have noticed Spark has a SQL module, so far I really don't have time to try it, but can I ask what are the differences between the SQL module and RDD SQL like functions?
There is no good reason for
groupByKey
followed byjoin
to be faster thanjoin
alone. Ifrdd1
andrdd2
have no partitioner or partitioners differ then a limiting factor is simply shuffling required forHashPartitioning
.By using
groupByKey
you not only increase a total cost by keeping mutable buffers required for grouping but what is more important you use an additional transformation which results in a more complex DAG.groupByKey
+join
:rdd1 = sc.parallelize([("a", 1), ("a", 3), ("b", 2)]) rdd2 = sc.parallelize([("a", 5), ("c", 6), ("b", 7)]) rdd1.groupByKey().join(rdd2)
vs.
join
alone:rdd1.join(rdd2)
Finally these two plans are not even equivalent and to get the same results you have to add an additional
flatMap
to the first one.This is a quite broad question but to highlight the main differences:
PairwiseRDDs
are homogeneous collections of arbitraryTuple2
elements. For default operations you want key to be hashable in a meaningful way otherwise there are no strict requirements regarding the type. In contrast DataFrames exhibit much more dynamic typing but each column can only contain values from a supported set of defined types. It is possible to define UDT but it still has to be expressed using basic ones.DataFrames use a Catalyst Optimizer which generates logical and physical execution planss and can generate highly optimized queries without need for applying manual low level optimizations. RDD based operations simply follow dependency DAG. It means worse performance without custom optimization but much better control over execution and some potential for fine graded tuning.
Some other things to read:
- Difference between DataFrame and RDD in Spark
- Why spark.ml don't implement any of spark.mllib algorithms?
这篇关于星火RDD groupByKey +加盟VS连接性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!