Hive Buckets-了解TABLESAMPLE(BUCKET X OUT OF Y) [英] Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

查看：59 发布时间：2022/1/14 8:12:38 hadoop mapreduce hive

本文介绍了Hive Buckets-了解TABLESAMPLE(BUCKET X OUT OF Y)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对 hive 非常陌生，我已经在 hadoop 中了解了桶的概念，但未能理解以下几行.有人可以帮助我吗?

Hi i am very much new to hive,i have gone through buckets concept in hadoop in action,but failed to understand the below lines.can any one help me on this?

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 32);

TABLESAMPLE 的一般语法是表格样本(桶 x 超出 y)

The general syntax for TABLESAMPLE is TABLESAMPLE(BUCKET x OUT OF y)

查询的样本量约为 1/y.此外，y 需要是在创建表时为表指定的桶数的倍数或因子.例如，如果我们将 y 更改为 16，则查询变为

The sample size for the query is around 1/y. In addition, y needs to be a multiple or factor of the number of buckets specified for the table at table creation time. For example, if we change y to 16, the query becomes

SELECT avg(viewTime)
 FROM page_view TABLESAMPLE(BUCKET 1 OUT OF 16);

然后样本大小包括大约每 16 个用户中的 1 个(因为存储桶列是用户 ID).该表仍有 32 个存储桶，但 Hive 尝试通过同时处理存储桶 1 和 17 来满足此查询.另一方面，如果 y 指定为 64，Hive 将对一个桶中的一半数据执行查询.x 的值仅用于选择要使用的存储桶.在真正的随机抽样下，它的值应该无关紧要.

Then the sample size includes approximately 1 out of every 16 users (as the bucket column is userid). The table still has 32 buckets, but Hive tries to satisfy this query by processing buckets 1 and 17 together. On the other hand, if y is specified to be 64, Hive will execute the query on half of the data in one bucket. The value of x is only used to select which bucket to use. Under truly random sampling its value shouldn’t matter.

Hive Buckets-了解TABLESAMPLE(BUCKET X OUT OF Y) [英] Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Hive Buckets-了解TABLESAMPLE(BUCKET X OUT OF Y) [英] Hive Buckets-understanding TABLESAMPLE(BUCKET X OUT OF Y)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭