BigQuery - 根据一些哈希标准对数据进行分区 [英] BigQuery - Partitioning data according to some hash criteria
问题描述
所以我的想法是通过做这样的事情来对数据进行采样:
if(ABS(HASH(uid) )%100 == 0)...
问题是这实际上会过滤1/100只有散列值的分布是一致的。所以,为了检查,我想生成下面的表:
(n从0到99)
0<其中uid%100 == 0>的行数
1<其中uid%100 == 1>的行数
2<其中uid%100 == 2>的行数
3<其中uid%100 == 3的行数>
.. etc。
如果我看到每一行中的数字都是相同的数量,那么我的假设是正确的。
任何想法如何创建这样的查询,或者另一种方式采样?
选择ABS(HASH(uid))%100作为群集,将count(*)作为cnt
从yourtable
按群集分组
UID具有不同的情况(上,下)和类型,您可以在哈希中使用一些字符串操作。类似于:
选择ABS(HASH(upper(string(uid))))%100作为集群,count(*) as cnt
从yourtable
按群集分组
I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100). So my idea is to sample the data by doing something like this:
if(ABS(HASH(uid)) % 100 == 0) ...
The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:
(n goes from 0 to 99)
0 <number of rows in which uid % 100 == 0>
1 <number of rows in which uid % 100 == 1>
2 <number of rows in which uid % 100 == 2>
3 <number of rows in which uid % 100 == 3>
.. etc.
If I see the numbers in each row are of the same magnitude, then my assumption is correct.
Any idea how to create such a query, or alternatively do the sampling another way?
Something like
Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster
the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:
Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster
这篇关于BigQuery - 根据一些哈希标准对数据进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!