BigQuery - 根据一些哈希标准对数据进行分区 [英] BigQuery - Partitioning data according to some hash criteria

查看:93
本文介绍了BigQuery - 根据一些哈希标准对数据进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在BigQuery中有一个表格。我有一个字符串列表示一个唯一的ID(uid)。我只想通过只取得一部分uid(比如说1/100)来过滤这张表的一个样本。
所以我的想法是通过做这样的事情来对数据进行采样:

  if(ABS(HASH(uid) )%100 == 0)... 

问题是这实际上会过滤1/100只有散列值的分布是一致的。所以,为了检查,我想生成下面的表:

 (n从0到99)

0<其中uid%100 == 0>的行数
1<其中uid%100 == 1>的行数
2<其中uid%100 == 2>的行数
3<其中uid%100 == 3的行数>

.. etc。



如果我看到每一行中的数字都是相同的数量,那么我的假设是正确的。



任何想法如何创建这样的查询,或者另一种方式采样?

解决方案



 选择ABS(HASH(uid))%100作为群集,将count(*)作为cnt 
从yourtable
按群集分组

UID具有不同的情况(上,下)和类型,您可以在哈希中使用一些字符串操作。类似于:

 选择ABS(HASH(upper(string(uid))))%100作为集群,count(*) as cnt 
从yourtable
按群集分组


I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100). So my idea is to sample the data by doing something like this:

if(ABS(HASH(uid)) % 100 == 0) ...

The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:

(n goes from 0 to 99)

0    <number of rows in which uid % 100 == 0>
1    <number of rows in which uid % 100 == 1>
2    <number of rows in which uid % 100 == 2>
3    <number of rows in which uid % 100 == 3>

.. etc.

If I see the numbers in each row are of the same magnitude, then my assumption is correct.

Any idea how to create such a query, or alternatively do the sampling another way?

解决方案

Something like

Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster 

the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:

 Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt 
From yourtable 
Group each by cluster 

这篇关于BigQuery - 根据一些哈希标准对数据进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆