如何对数据进行物理分区以避免Spark SQL联接中的乱序 [英] How to physically partition data to avoid shuffle in Spark SQL joins

查看：183 发布时间：2020/9/4 19:57:08 apache-spark-sql

本文介绍了如何对数据进行物理分区以避免Spark SQL联接中的乱序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要将5个中等大小的表(每个〜80 gb)连接在一起，输入数据很大〜800 gb.所有数据均位于HIVE表中. 我正在使用Spark SQL 1.6.1实现此目的. 加入需要40分钟的时间才能完成 --num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6.所有联接都是排序合并外部联接.还看到了许多洗牌活动.

I have a requirement to join 5 medium size tables (~80 gb each) with a big Input data ~ 800 gb. All data resides in HIVE tables. I am using Spark SQL 1.6.1 for achieving this. Join is taking 40 mins of time to complete with --num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6. All joins are sort merge outer joins. Also seeing a lot of shuffle happening.

我将配置单元中的所有表存储到相同数量的存储区中，以便所有表中的相似键在第一次加载数据本身时将进入相同的spark分区.但似乎星火无法理解转储.

I bucketed all tables in hive into same number of buckets so that similar keys from all tables will go to same spark partitions while loading data itself at first. But it seems spark does not understand bucketing.

我还有其他方法可以对分区进行物理分区吗?在Hive中排序数据(没有零件文件)，以便Spark在从Hive本身加载数据时知道分区键，并在相同的分区中进行联接而不用乱码数据?这样可以避免在从配置单元中加载数据后进行额外的重新分区.

Is there any other way i can physically partition & sort data in Hive (no of part files) so that spark will know about partitioning keys while loading data from hive itself and do a join with in the same partitioning without shuffling data around? This will avoid additional re-Partitioning after loading data from hive.

如何对数据进行物理分区以避免Spark SQL联接中的乱序 [英] How to physically partition data to avoid shuffle in Spark SQL joins

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何对数据进行物理分区以避免Spark SQL联接中的乱序 [英] How to physically partition data to avoid shuffle in Spark SQL joins

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭