配置单元-分区和分区 [英] Hive - Bucketing and Partitioning

查看:121
本文介绍了配置单元-分区和分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们应该缩小在Hive中对一组列使用分区还是存储分区的依据?

假设我们有一个庞大的数据集,其中有两个最常查询的列-所以我的明显选择可能是基于这两个列进行分区,但是如果这会导致大量的小数据在大量目录中创建的文件,那么根据这些列对数据进行分区将是一个错误的决定,并且可能进行存储分区是一个更好的选择.

我们可以定义一种方法来决定是否应该进行存储分区或分区吗?

解决方案

加壳和分区不是排他的,您可以同时使用.

从我很长的蜂巢经验中我得到的简短回答是:您应该始终使用分区,有时您可能也想使用存储分区".

如果表很大,分区有助于减少查询的数据量.分区通常表示为HDFS上的目录.常见用法是按年/月/日分区,因为大多数人都按日期查询. 唯一的缺点是您不应在基数大的列上进行分区. 基数是大数据中的基本概念,它是一列可能具有的值的数量.例如,"US state"的基数低(约50),而"ip_number"的基数大(2 ^ 32个可能的数字). 如果您在具有高基数的字段上进行分区,则配置单元将在HDFS中创建大量目录,这是不好的(名称节点上的额外内存负载).

加括号很有用,但是在将数据插入表中时也必须遵守纪律. Hive不会检查您要插入的数据是否按照应有的方式进行存储. 存储桶的表必须执行CLUSTER BY,这可能会在处理中增加额外的步骤. 但是,如果您进行大量联接,如果两个表以相同的方式存储(在相同的字段和相同数量的存储桶中),则可以大大加快连接的速度.而且,一旦确定了存储桶的数量,就无法轻松地对其进行更改.

What should be basis for us to narrow down whether to use partition or bucketing on a set of columns in Hive?

Suppose we have a huge data set, where we have two columns which are queried most often - so my obvious choice might be to make the partition based on these two columns, but also if this would result into a huge number of small files created in huge number of directories, than it would be a wrong decision to partition data based on these columns, and may be bucketing would have been a better option to do.

Can we define a methodology using which we can decide if we should go for bucketing or partitioning?

解决方案

Bucketing and partitioning are not exclusive, you can use both.

My short answer from my fairly long hive experience is "you should ALWAYS use partitioning, and sometimes you may want to bucket too".

If you have a big table, partitioning helps reducing the amount of data you query. A partition is usually represented as a directory on HDFS. A common usage is to partition by year/month/day, since most people query by date. The only drawback is that you should not partition on columns with a big cardinality. Cardinality is a fundamental concept in big data, it's the number of possible values a column may have. 'US state' for instance has a low cardinality (around 50), while for instance 'ip_number' has a large cardinality (2^32 possible numbers). If you partition on a field with a high cardinality, hive will create a very large number of directories in HDFS, which is not good (extra memory load on namenode).

Bucketing can be useful, but you also have to be disciplined when inserting data into a table. Hive won't check that the data you're inserting is bucketed the way it's supposed to. A bucketed table has to do a CLUSTER BY, which may add an extra step in your processing. But if you do lots of joins, they can be greatly sped up if both tables are bucketed the same way (on the same field and the same number of buckets). Also, once you decide the number of buckets, you can't easily change it.

这篇关于配置单元-分区和分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆