蜂巢桶-了解表(桶X X出Y) [英] Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

查看:178
本文介绍了蜂巢桶-了解表(桶X X出Y)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我是蜂巢的新手,我在行动中经历了桶式概念,但未能理解以下内容.有人可以帮助我吗?

Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this?

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32);

TABLESAMPLE的一般语法是 TABLESAMPLE(桶x在y之外)

The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y)

查询的样本大小约为1/y.此外,y必须是在创建表时为表指定的存储桶数的倍数或因数.例如,如果我们将y更改为16,则查询变为

The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16);

然后,样本大小包括每16个用户中的大约1个(因为存储桶列是userid).该表仍具有32个存储桶,但Hive尝试通过一起处理存储桶1和17来满足此查询.另一方面,如果将y指定为64,则Hive将对一个存储桶中的一半数据执行查询. x的值仅用于选择要使用的存储桶.在真正随机抽样下,其值无关紧要.

Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.

推荐答案

您不理解其中的哪一部分?

Which part of it don't you understand?

创建表并使用clustered by子句将其存储到32个存储桶中(例如)时,配置单元使用确定性哈希函数将数据存储到32个存储桶中.然后,当您使用TABLESAMPLE(BUCKET x OUT OF y)时,配置单元会将您的存储桶分为 y 个存储桶组,然后选择每个组的第 x 个存储桶.例如:

When you create the table and bucket it using the clustered by clause into 32 buckets (as an example), hive buckets your data into 32 buckets using deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x OUT OF y), hive divides your buckets into groups of y buckets and then picks the x'th bucket of each group. For example:

  • 如果使用TABLESAMPLE(BUCKET 6 OUT OF 8),则配置单元会将32个存储桶分为8个存储桶的组,从而产生4组8个存储桶,然后选择每个组的第6个存储桶,因此选择6、14、22个存储桶,30.

  • If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32 buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.

如果使用TABLESAMPLE(BUCKET 23 OUT OF 32),则配置单元会将32个存储桶分为32个组,仅产生一组32个存储桶,然后选择第23个存储桶作为结果.

If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your 32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks the 23rd bucket as your result.

如果使用TABLESAMPLE(BUCKET 3 OUT OF 64),则配置单元会将32个存储桶分为64个存储桶的组,从而产生一组64个半存储桶",然后选择对应于第三个完整存储桶的半存储桶-bucket.

If you use TABLESAMPLE(BUCKET 3 OUT OF 64), hive would divide your 32 buckets into groups of 64 buckets, resulting in 1 group of 64 "half-bucket"s and then picks the half-bucket that corresponds to the 3rd full-bucket.

这篇关于蜂巢桶-了解表(桶X X出Y)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆