Apache Spark 处理倾斜数据 [英] Apache Spark Handling Skewed Data

查看：26 发布时间：2021/11/14 21:53:38 scala hadoop apache-spark spark-dataframe

本文介绍了Apache Spark 处理倾斜数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将两张桌子合并在一起.其中之一有一个非常糟糕的数据倾斜.这导致我的 spark 作业无法并行运行，因为大部分工作都在一个分区上完成.

I have two tables I would like to join together. One of them has a very bad skew of data. This is causing my spark job to not run in parallel as a majority of the work is done on one partition.

我听说过并尝试过对我的密钥进行加盐以增加分发.https://www.youtube.com/watch?v=WyfHUNnMutg 12:45秒正是我想做的.

I have heard and read and tried to implement salting my keys to increase the distribution. https://www.youtube.com/watch?v=WyfHUNnMutg at 12:45 seconds is exactly what I would like to do.

任何帮助或提示将不胜感激.谢谢！

Any help or tips would be appreciated. Thanks!

推荐答案

是的，您应该在较大的表上使用加盐键(通过随机化)，然后复制较小的键/笛卡尔将其加入新的加盐键:

Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:

这里有一些建议:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin

Tresata skew join RDD https://github.com/tresata/spark-skewjoin

python 倾斜连接:https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

tresata 库如下所示:

import com.tresata.spark.skewjoin.Dsl._  // for the implicits   

// skewjoin() method pulled in by the implicits
rdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),   
DefaultSkewReplication(1)).sortByKey(true).collect.toLis

这篇关于Apache Spark 处理倾斜数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Spark 处理倾斜数据 [英] Apache Spark Handling Skewed Data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark 处理倾斜数据 [英] Apache Spark Handling Skewed Data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭