Hive外部表最佳分区大小 [英] Hive external table optimal partition size

查看:122
本文介绍了Hive外部表最佳分区大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

外部表分区的最佳大小是多少?我打算按年/月/日对表进行分区,每天将获得约2GB的数据.

What is the optimal size for external table partition? I am planning to partition table by year/month/day and we are getting about 2GB of data daily.

推荐答案

配置单元分区定义将存储在metastore中,因此太多的分区将占用metastore中的大量空间.

Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.

分区将作为目录存储在HDFS中,因此许多分区键将生成多级目录,这会使它们的扫描速度变慢.

Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.

您的查询将作为MapReduce作业执行,因此创建太小的分区是没有用的.

Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.

要视情况而定,请考虑如何查询数据.对于您的情况,我更喜欢一个定义为'yyyymmdd'的键,因此,我们每年将获得365个分区,表目录中只有一个级别,而2G数据/分区则非常适合MapReduce作业.

It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.

对于答案的完整性,如果您使用Hive<0.12,输入分区键字符串,请参见此处.

For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.

Usefull博客此处.

Usefull blog here.

这篇关于Hive外部表最佳分区大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆