随机样本组而不是 SQL 中的行 [英] Random sample groups instead of rows in SQL
问题描述
假设我的桌子是这样的:
So let's say that my table is like this:
+---------------------+
|ColA | ColB | ColC |
+---------------------+
| AA C 5 |
| AA C 8 |
| AA C 9 |
| BB D 3 |
| BB D 4 |
| CC E 1 |
| CC E 2 |
| CC E 3 |
| CC E 5 |
+---------------------+
我想从 COL A 中随机抽样组而不是行.每个组都是一个事件序列,我想要每个组的所有这些事件,所以这就是为什么我不想只对行进行采样.我想在下面这样采样:
I want to randomly sample groups from COL A instead of rows. Each group is a sequence of events and i want all those events for each group, so that's why i do not want to just sample rows. I would like to sample like this below:
+---------------------+
ColA | ColB | ColC
+---------------------+
| AA C 5 |
| AA C 8 |
| AA C 9 |
| CC E 1 |
| CC E 2 |
| CC E 3 |
| CC E 5 |
+---------------------+
下面显然是对行进行采样,所以不好.
The below obviously samples rows so it's not good.
SELECT * FROM TABLE TABLE_SAMPLE BERNOULI(1)
GROUP BY COLUMN A
我还有一个包含 10 个组的 COL B,我希望所有这些 qgroup 在它们的总行数方面具有相同的表示.谢谢!
I also have a COL B with 10 groups and i would like all those qroups to have an equal representation in respect to their total number of rows. Thank you!
推荐答案
在 Presto/Trino 中没有直接的方法来做到这一点.不过,您可以采用几种方法.
There's no direct way of doing that in Presto / Trino. There are a couple of approaches you can take, though.
如果您想要确定性采样,您可以通过应用统一哈希函数并选择组的百分比来将组转换为数字.例如:
If you want deterministic sampling, you can convert the group to a number by applying a uniform hash function and selecting a percentage of the groups. For instance:
SELECT *
FROM t
WHERE bitwise_and(from_big_endian_64(xxhash64(cast(a as varbinary))), 1023) <= 0.5 * 1023
它的工作原理:
- 使用
xxhash64计算分组列的哈希值代码>
:xxhash64(cast(a as varbinary))
.这会产生一个 64 位二进制值. - 使用
from_big_endian_64将二进制值转换为数字代码>
并使用 <从中挑选 10 位代码>bitwise_and(x, 1023).这会产生一个介于 0 和 1023 之间的数字. - 通过选择任何组使上述值小于 0.01 * 1023,以
1%
采样.
- Computing a hash of the grouping column with
xxhash64
:xxhash64(cast(a as varbinary))
. That produces a 64-bit binary value. - Convert the binary value to a number with
from_big_endian_64
and pick, say, 10 bits out of it usingbitwise_and(x, 1023)
. That produces a number between 0 and 1023. - Sample at
1%
by selecting any groups such that the value above is less than 0.01 * 1023.
如果您想要对组进行非确定性随机抽样,您可以先对组进行抽样,然后选择与这些组匹配的行:
If you want non-deterministic random sampling of groups, you can first sample the groups and then select rows matching those groups:
WITH
groups AS (SELECT DISTINCT colA FROM t),
sampled_groups AS (SELECT * FROM groups TABLESAMPLE BERNOULLI(1)),
SELECT *
FROM t
WHERE colA IN (TABLE sampled_groups)
这篇关于随机样本组而不是 SQL 中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!