BigQuery - 根据一些哈希标准对数据进行分区 [英] BigQuery - Partitioning data according to some hash criteria

查看：93 发布时间：2018/5/7 17:31:16 hash google-bigquery

本文介绍了BigQuery - 根据一些哈希标准对数据进行分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在BigQuery中有一个表格。我有一个字符串列表示一个唯一的ID（uid）。我只想通过只取得一部分uid（比如说1/100）来过滤这张表的一个样本。
所以我的想法是通过做这样的事情来对数据进行采样：

if（ABS（HASH（uid））％100 == 0）...
问题是这实际上会过滤1/100只有散列值的分布是一致的。所以，为了检查，我想生成下面的表：

（n从0到99） 0<其中uid％100 == 0>的行数 1<其中uid％100 == 1>的行数 2<其中uid％100 == 2>的行数 3<其中uid％100 == 3的行数>
.. etc。

如果我看到每一行中的数字都是相同的数量，那么我的假设是正确的。

任何想法如何创建这样的查询，或者另一种方式采样？
解决方案

选择ABS（HASH（uid））％100作为群集，将count（*）作为cnt 从yourtable 按群集分组
UID具有不同的情况（上，下）和类型，您可以在哈希中使用一些字符串操作。类似于：
选择ABS（HASH（upper（string（uid））））％100作为集群，count（*） as cnt 从yourtable 按群集分组

I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100). So my idea is to sample the data by doing something like this:
if(ABS(HASH(uid)) % 100 == 0) ...
The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:
(n goes from 0 to 99) 0 <number of rows in which uid % 100 == 0> 1 <number of rows in which uid % 100 == 1> 2 <number of rows in which uid % 100 == 2> 3 <number of rows in which uid % 100 == 3>
.. etc.

If I see the numbers in each row are of the same magnitude, then my assumption is correct.

Any idea how to create such a query, or alternatively do the sampling another way?
解决方案
Something like
Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt From yourtable Group each by cluster
the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:
Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt From yourtable Group each by cluster

这篇关于BigQuery - 根据一些哈希标准对数据进行分区的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BigQuery - 根据一些哈希标准对数据进行分区 [英] BigQuery - Partitioning data according to some hash criteria

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

BigQuery - 根据一些哈希标准对数据进行分区 [英] BigQuery - Partitioning data according to some hash criteria

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭