星火RDD groupByKey +加盟VS连接性能 [英] Spark RDD groupByKey + join vs join performance

查看:338
本文介绍了星火RDD groupByKey +加盟VS连接性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的,我与他人用户共享群集上的火花。因此,它是不可靠的,告诉我哪个code之一正是基​​于运行时间运行更加高效。因为当我运行更高效的code,别人可能运行大量数据的作品,使我的code执行时间较长。

所以我可以在这里问两个问题:


  1. 我用加入功能的加入2 RDDS ,我试图用 groupByKey()使用前加入,就像这样:

      rdd1.groupByKey()。加入(RDD2)

    似乎花了更长的时间,但是我记得我使用Hadoop配置单元,该集团通过了我的查询跑得更快时。由于星火使用懒惰的评价,我想知道是否 groupByKey 加入使事情更快


  2. 我注意到星火有一个SQL模块,到目前为止,我真的没有时间去尝试,但我可以问是什么SQL模块与RDD SQL类的函数?之间的区别



解决方案

  1. 有没有好的理由 groupByKey 然后按加入要快于加入孤单。如果 RDD1集 RDD2 没有分区或再partitioners不同的是简单的洗牌需要 HashPartitioning 。

    通过使用 groupByKey 你不仅通过保持对分组所需的可变缓冲区增加的总成本,但什么是你使用一个额外的转换导致更复杂的DAG更重要。 groupByKey + 加入

      RDD1集= sc.parallelize([(一,1),(一,3),(B,2)])
    RDD2 = sc.parallelize([(一,5),(C,6),(B,7)])
    rdd1.groupByKey()。加入(RDD2)

    在这里输入的形象描述

    VS。 加入独自:

      rdd1.join(RDD2)

    在这里输入的形象描述

    最后这两个方案甚至不等价,并得到你必须添加相同业绩的额外 flatMap 来的第一个。


  2. 这只是突出的主要差别相当宽泛的问题:


    • PairwiseRDDs 是任意 Tuple2 元素的均匀的集合。对于默认操作你想关键是哈希以有意义的方式,否则有相关的类型没有严格的要求。相比之下DataFrames表现出更多的动态类型,但每列只能从的支持的设置定义类型的的。它可以定义 UDT 的,但它仍然必须是前$ P $使用最基本的pssed


    • DataFrames使用的催化剂优化产生的逻辑和物理执行planss并能生成而不需要高度优化的查询申请手动低级别的优化。基于RDD操作只需按照依赖DAG。这意味着无需自定义优化性能越差,但在执行的更好控制和精细分级调整一些潜在的。



一些其他的事情阅读:

I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time.

So can I ask 2 questions here:

  1. I was using join function to join 2 RDDsand I am trying to use groupByKey() before using join, like this:

    rdd1.groupByKey().join(rdd2)
    

    seems that it took longer time, however I remember when I was using Hadoop Hive, the group by made my query ran faster. Since Spark is using lazy evaluation, I am wondering whether groupByKey before join makes things faster

  2. I have noticed Spark has a SQL module, so far I really don't have time to try it, but can I ask what are the differences between the SQL module and RDD SQL like functions?

解决方案

  1. There is no good reason for groupByKey followed by join to be faster than join alone. If rdd1 and rdd2 have no partitioner or partitioners differ then a limiting factor is simply shuffling required for HashPartitioning.

    By using groupByKey you not only increase a total cost by keeping mutable buffers required for grouping but what is more important you use an additional transformation which results in a more complex DAG. groupByKey + join:

    rdd1 = sc.parallelize([("a", 1), ("a", 3), ("b", 2)])
    rdd2 = sc.parallelize([("a", 5), ("c", 6), ("b", 7)])
    rdd1.groupByKey().join(rdd2)
    

    vs. join alone:

    rdd1.join(rdd2)
    

    Finally these two plans are not even equivalent and to get the same results you have to add an additional flatMap to the first one.

  2. This is a quite broad question but to highlight the main differences:

    • PairwiseRDDs are homogeneous collections of arbitraryTuple2 elements. For default operations you want key to be hashable in a meaningful way otherwise there are no strict requirements regarding the type. In contrast DataFrames exhibit much more dynamic typing but each column can only contain values from a supported set of defined types. It is possible to define UDT but it still has to be expressed using basic ones.

    • DataFrames use a Catalyst Optimizer which generates logical and physical execution planss and can generate highly optimized queries without need for applying manual low level optimizations. RDD based operations simply follow dependency DAG. It means worse performance without custom optimization but much better control over execution and some potential for fine graded tuning.

Some other things to read:

这篇关于星火RDD groupByKey +加盟VS连接性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆