Hive Buckets-了解TABLESAMPLE(BUCKET X OUT OF Y) [英] Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

查看:59
本文介绍了Hive Buckets-了解TABLESAMPLE(BUCKET X OUT OF Y)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 hive 非常陌生,我已经在 hadoop 中了解了桶的概念,但未能理解以下几行.有人可以帮助我吗?

Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this?

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32);

TABLESAMPLE 的一般语法是表格样本(桶 x 超出 y)

The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y)

查询的样本量约为 1/y.此外,y 需要是在创建表时为表指定的桶数的倍数或因子.例如,如果我们将 y 更改为 16,则查询变为

The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16);

然后样本大小包括大约每 16 个用户中的 1 个(因为存储桶列是用户 ID).该表仍有 32 个存储桶,但 Hive 尝试通过同时处理存储桶 1 和 17 来满足此查询.另一方面,如果 y 指定为 64,Hive 将对一个桶中的一半数据执行查询.x 的值仅用于选择要使用的存储桶.在真正的随机抽样下,它的值应该无关紧要.

Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.

推荐答案

哪一部分你不明白?

当您创建表并使用 clustered by 子句将其存储到 32 个存储桶中(例如)时,hive 使用确定性哈希函数将您的数据存储到 32 个存储桶中.然后,当您使用 TABLESAMPLE(BUCKET x OUT OF y) 时,hive 会将您的存储桶分成 y 个存储桶组,然后选择第 x 个每组的桶.例如:

When you create the table and bucket it using the clustered by clause into 32 buckets (as an example), hive buckets your data into 32 buckets using deterministic hash functions. Then when you use TABLESAMPLE(BUCKET x OUT OF y), hive divides your buckets into groups of y buckets and then picks the x'th bucket of each group. For example:

  • 如果您使用 TABLESAMPLE(BUCKET 6 OUT OF 8),hive 会将您的 32 个桶分成 8 个桶为一组,从而产生 4 组,每组 8 个桶,然后选择第 6 个桶每个组,因此选择桶 6、14、22、30.

  • If you use TABLESAMPLE(BUCKET 6 OUT OF 8), hive would divide your 32 buckets into groups of 8 buckets resulting in 4 groups of 8 buckets and then picks the 6th bucket of each group, hence picking the buckets 6, 14, 22, 30.

如果您使用 TABLESAMPLE(BUCKET 23 OUT OF 32),hive 会将您的 32 个桶分成 32 个一组,导致只有 1 组 32 个桶,然后选择第 23 个桶作为你的结果.

If you use TABLESAMPLE(BUCKET 23 OUT OF 32), hive would divide your 32 buckets into groups of 32, resulting in only 1 group of 32 buckets, and then picks the 23rd bucket as your result.

如果您使用 TABLESAMPLE(BUCKET 3 OUT OF 64),hive 会将您的 32 个存储桶分成 64 个存储桶组,从而产生 1 组 64 个半存储桶"和然后选择对应于第三个完整存储桶的半存储桶.

If you use TABLESAMPLE(BUCKET 3 OUT OF 64), hive would divide your 32 buckets into groups of 64 buckets, resulting in 1 group of 64 "half-bucket"s and then picks the half-bucket that corresponds to the 3rd full-bucket.

这篇关于Hive Buckets-了解TABLESAMPLE(BUCKET X OUT OF Y)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆