如何物理分区数据以避免在 Spark SQL 连接中混洗 [英] How to physically partition data to avoid shuffle in Spark SQL joins

查看：25 发布时间：2021/11/14 22:49:34 apache-spark-sql

本文介绍了如何物理分区数据以避免在 Spark SQL 连接中混洗的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要将 5 个中等大小的表(每个表约 80 GB)与一个大输入数据(约 800 GB)连接起来.所有数据都驻留在 HIVE 表中.我正在使用 Spark SQL 1.6.1 来实现这一点.加入需要 40 分钟的时间才能完成--num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6.所有连接都是排序合并外部连接.也看到很多洗牌发生.

I have a requirement to join 5 medium size tables (~80 gb each) with a big Input data ~ 800 gb. All data resides in HIVE tables. I am using Spark SQL 1.6.1 for achieving this. Join is taking 40 mins of time to complete with --num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6. All joins are sort merge outer joins. Also seeing a lot of shuffle happening.

我将 hive 中的所有表存储到相同数量的存储桶中，以便所有表中的相似键在最初加载数据时会转到相同的 Spark 分区.但是spark好像不理解bucketing.

I bucketed all tables in hive into same number of buckets so that similar keys from all tables will go to same spark partitions while loading data itself at first. But it seems spark does not understand bucketing.

还有其他方法可以物理分区吗?对 Hive 中的数据进行排序(没有部分文件)，以便 spark 在从 Hive 本身加载数据时知道分区键，并在同一个分区中进行连接而不会混洗数据?这将避免在从 hive 加载数据后进行额外的重新分区.

Is there any other way i can physically partition & sort data in Hive (no of part files) so that spark will know about partitioning keys while loading data from hive itself and do a join with in the same partitioning without shuffling data around? This will avoid additional re-Partitioning after loading data from hive.

如何物理分区数据以避免在 Spark SQL 连接中混洗 [英] How to physically partition data to avoid shuffle in Spark SQL joins

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何物理分区数据以避免在 Spark SQL 连接中混洗 [英] How to physically partition data to avoid shuffle in Spark SQL joins

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭