Apache Spark处理歪斜的数据 [英] Apache Spark Handling Skewed Data

查看:151
本文介绍了Apache Spark处理歪斜的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两张桌子,我想联合在一起。其中之一有一个非常糟糕的数据偏差。这导致我的spark工作不能并行运行,因为大部分工作都是在一个分区上完成的。



我已经听过并阅读并试图实现对我的密钥增加分配。
https://www.youtube.com/watch?v=WyfHUNnMutg 在12:45秒正是我想要做的。



任何帮助或提示将不胜感激。谢谢!

解决方案

是的,你应该在较大的表上使用盐键(通过随机化),然后复制较小的一个/笛卡尔连接它的新盐渍的一个:



以下是一些建议:

lockquote

Tresata skew join RDD https://github.com/tresata/spark-skewjoin



python skew join:
https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/



$ b $ < tresata 库看起来像这样:

  import com.tresata.spark.skewjoin.Dsl._ //为implicits 

// skewjoin()方法由implicits拉入
rdd1.skewJoin(rdd2 ,defaultPartitioner(rdd1,rdd2),
DefaultSkewReplication(1))。sortByKey(true).coll ect.toLis


I have two tables I would like to join together. One of them has a very bad skew of data. This is causing my spark job to not run in parallel as a majority of the work is done on one partition.

I have heard and read and tried to implement salting my keys to increase the distribution. https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.

Any help or tips would be appreciated. Thanks!

解决方案

Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:

Here are a couple of suggestions:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin

python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

The tresata library looks like this:

import com.tresata.spark.skewjoin.Dsl._  // for the implicits   

// skewjoin() method pulled in by the implicits
rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),   
DefaultSkewReplication(1)).sortByKey(true).collect.toLis

这篇关于Apache Spark处理歪斜的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆