Apache Spark处理歪斜的数据 [英] Apache Spark Handling Skewed Data
问题描述
我已经听过并阅读并试图实现对我的密钥增加分配。
https://www.youtube.com/watch?v=WyfHUNnMutg 在12:45秒正是我想要做的。
任何帮助或提示将不胜感激。谢谢!
是的,你应该在较大的表上使用盐键(通过随机化),然后复制较小的一个/笛卡尔连接它的新盐渍的一个:
以下是一些建议:
lockquote
Tresata skew join RDD https://github.com/tresata/spark-skewjoin
python skew join:
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
$ b $ < tresata
库看起来像这样:
import com.tresata.spark.skewjoin.Dsl._ //为implicits
// skewjoin()方法由implicits拉入
rdd1.skewJoin(rdd2 ,defaultPartitioner(rdd1,rdd2),
DefaultSkewReplication(1))。sortByKey(true).coll ect.toLis
I have two tables I would like to join together. One of them has a very bad skew of data. This is causing my spark job to not run in parallel as a majority of the work is done on one partition.
I have heard and read and tried to implement salting my keys to increase the distribution. https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.
Any help or tips would be appreciated. Thanks!
Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:
Here are a couple of suggestions:
Tresata skew join RDD https://github.com/tresata/spark-skewjoin
python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
The tresata
library looks like this:
import com.tresata.spark.skewjoin.Dsl._ // for the implicits
// skewjoin() method pulled in by the implicits
rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),
DefaultSkewReplication(1)).sortByKey(true).collect.toLis
这篇关于Apache Spark处理歪斜的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!