在火花加盟,确实表顺序事象猪? [英] In spark join, does table order matter like in pig?

查看:202
本文介绍了在火花加盟,确实表顺序事象猪?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

星火相关 - 加入2 PairRDD元素

在做一个常规连接猪,在加入不带入内存,但通过代替流,因此,如果有每个键和B大的基数小的基数,是显著更好地做到<$ C $的最后一个表C>加入A,B 比经B加入A ,从性能的角度来看(避免溢出和OOM)

When doing a regular join in pig, the last table in the join is not brought into memory but streamed through instead, so if A has small cardinality per key and B large cardinality, it is significantly better to do join A, B than join A by B, from performance perspective (avoiding spill and OOM)

是否有火花类似的概念?我没有看到任何这样的建议,并想知道它是如何可能的?实现在我看来pretty大致相同的猪:<一href=\"https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala\">https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala

Is there a similar concept in spark? I didn't see any such recommendation, and wonder how is it possible? The implementation looks to me pretty much the same as in pig: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala

还是我失去了一些东西?

Or am I missing something?

推荐答案

它不会有所作为,火花如果缓存RDD只会被带入内存。因此,在火花,达到同样的效果,你可以缓存较小RDD。您可以在其中的火花,我不知道,猪确实做的另一件事,就是如果所有RDD的被连接已经在同一个分区没有洗牌需要做的事情。

It does not make a difference, in spark the RDD will only be brought into memory if it is cached. So in spark to achieve the same effect you can cache the smaller RDD. Another thing you can do in spark which I'm not sure that pig does, is if all RDD's being joined have the same partitioner no shuffle needs to be done.

这篇关于在火花加盟,确实表顺序事象猪?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆