随机样本组而不是 SQL 中的行 [英] Random sample groups instead of rows in SQL

查看:29
本文介绍了随机样本组而不是 SQL 中的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我的桌子是这样的:

So let's say that my table is like this:

+---------------------+
|ColA   | ColB | ColC |
+---------------------+
| AA       C        5 |      
| AA       C        8 |     
| AA       C        9 |      
| BB       D        3 |      
| BB       D        4 |    
| CC       E        1 |    
| CC       E        2 |     
| CC       E        3 |     
| CC       E        5 |    
+---------------------+

我想从 COL A 中随机抽样组而不是行.每个组都是一个事件序列,我想要每个组的所有这些事件,所以这就是为什么我不想只对行进行采样.我想在下面这样采样:

I want to randomly sample groups from COL A instead of rows. Each group is a sequence of events and i want all those events for each group, so that's why i do not want to just sample rows. I would like to sample like this below:

+---------------------+
 ColA  | ColB |   ColC 
+---------------------+
| AA       C        5 |      
| AA       C        8 |     
| AA       C        9 |    
| CC       E        1 |      
| CC       E        2 |      
| CC       E        3 |     
| CC       E        5 |      
+---------------------+

下面显然是对行进行采样,所以不好.

The below obviously samples rows so it's not good.

SELECT * FROM TABLE TABLE_SAMPLE BERNOULI(1)
GROUP BY COLUMN A

我还有一个包含 10 个组的 COL B,我希望所有这些 qgroup 在它们的总行数方面具有相同的表示.谢谢!

I also have a COL B with 10 groups and i would like all those qroups to have an equal representation in respect to their total number of rows. Thank you!

推荐答案

在 Presto/Trino 中没有直接的方法来做到这一点.不过,您可以采用几种方法.

There's no direct way of doing that in Presto / Trino. There are a couple of approaches you can take, though.

如果您想要确定性采样,您可以通过应用统一哈希函数并选择组的百分比来将组转换为数字.例如:

If you want deterministic sampling, you can convert the group to a number by applying a uniform hash function and selecting a percentage of the groups. For instance:

SELECT *
FROM t
WHERE bitwise_and(from_big_endian_64(xxhash64(cast(a as varbinary))), 1023) <= 0.5 * 1023

它的工作原理:

  • Computing a hash of the grouping column with xxhash64: xxhash64(cast(a as varbinary)). That produces a 64-bit binary value.
  • Convert the binary value to a number with from_big_endian_64 and pick, say, 10 bits out of it using bitwise_and(x, 1023). That produces a number between 0 and 1023.
  • Sample at 1% by selecting any groups such that the value above is less than 0.01 * 1023.

如果您想要对组进行非确定性随机抽样,您可以先对组进行抽样,然后选择与这些组匹配的行:

If you want non-deterministic random sampling of groups, you can first sample the groups and then select rows matching those groups:

WITH 
   groups AS (SELECT DISTINCT colA FROM t),
   sampled_groups AS (SELECT * FROM groups TABLESAMPLE BERNOULLI(1)),
SELECT * 
FROM t
WHERE colA IN (TABLE sampled_groups)

这篇关于随机样本组而不是 SQL 中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆