蜂巢分区表上的火花行为 [英] spark behavior on hive partitioned table

查看:80
本文介绍了蜂巢分区表上的火花行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Spark 2.

I use Spark 2.

实际上,我不是执行查询的人,因此我无法包括查询计划.数据科学团队已经问过我这个问题.

Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.

我们将配置单元表划分为2000个分区,并以实木复合地板格式存储.当在spark中使用此表时,执行器之间将恰好执行2000个任务.但是我们的块大小为256 MB,我们期望分区的总数(总大小为256)肯定会比2000小得多.是否有任何内部逻辑可以激发使用数据的物理结构来创建分区.任何参考/帮助将不胜感激.

We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.

更新:这是另一回事.实际上,我们的表非常庞大,例如具有2000个分区的3 TB.3TB/256MB实际上会达到11720,但是与表在物理上进行分区相比,我们拥有完全相同的分区数.我只想了解如何在数据量上生成任务.

UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.

推荐答案

通常,Hive分区不会以1:1映射到Spark分区.1个Hive分区可以分为多个Spark分区,一个Spark分区可以容纳多个hive分区.

In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.

加载配置单元表时,Spark分区的数量取决于参数:

The number of Spark partitions when you load a hive-table depends on the parameters:

spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)

您可以检查分区,例如使用

You can check the partitions e.g. using

spark.table(yourtable).rdd.partitions

这将为您提供一个 FilePartitions 数组,其中包含文件的物理路径.

This will give you an Array of FilePartitions which contain the physical path of your files.

为什么从2000个hive分区中确切得到2000个Spark分区对我来说似乎是一个巧合,但根据我的经验,这种情况极不可能发生.请注意,spark 1.6的情况有所不同,spark分区的数量类似于文件系统上的文件数量(1个文件为1个spark分区,除非文件很大)

Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)

这篇关于蜂巢分区表上的火花行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆