Google BigQuery 中的随机抽样 [英] Random Sampling in Google BigQuery

查看:34
本文介绍了Google BigQuery 中的随机抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现 RAND() 函数虽然没有记录,但可以在 BigQuery 中使用.我能够使用以下方法从莎士比亚数据集中生成一个(看似)随机的 10 个单词样本:

I just discovered that the RAND() function, while undocumented, works in BigQuery. I was able to generate a (seemingly) random sample of 10 words from the Shakespeare dataset using:

SELECT word FROM
(SELECT rand() as random,word FROM [publicdata:samples.shakespeare] ORDER BY random)
LIMIT 10

我的问题是:使用这种方法代替参考手册高级示例"部分中定义的 HASH() 方法有什么缺点吗?https://developers.google.com/bigquery/query-reference

My question is: Are there any disadvantages to using this approach instead of the HASH() method defined in the "Advanced examples" section of the reference manual? https://developers.google.com/bigquery/query-reference

推荐答案

对于分层抽样,请查看 https://stackoverflow.com/a/52901452/132438

找到它的好工作:).我最近请求了该功能,但尚未将其写入文档.

Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

我会说 RAND() 的优点是结果会有所不同,而 HASH() 会不断为您提供相同值的相同结果(不保证随着时间的推移,但您明白了).

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

如果您希望 RAND() 带来的可变性同时仍然获得一致的结果 - 您可以使用整数作为种子,如 RAND(3).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

请注意,虽然您粘贴的示例正在执行完整的随机值排序 - 对于足够大的输入,此方法无法扩展.

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

一种可扩展的方法,可以获得大约 10 个随机行:

A scalable approach, to get around 10 random rows:

SELECT word
FROM [publicdata:samples.shakespeare]
WHERE RAND() < 10/164656

(其中 10 是我想要获得的大致结果数,164656 是该表的行数)

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/164656

甚至:

#standardSQL
SELECT word
FROM `publicdata.samples.shakespeare`
WHERE RAND() < 10/(SELECT COUNT(*) FROM `publicdata.samples.shakespeare`)

这篇关于Google BigQuery 中的随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆