Google BigQuery中的随机抽样 [英] Random Sampling in Google BigQuery

查看:116
本文介绍了Google BigQuery中的随机抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现RAND()函数虽然没有文档,但在BigQuery中起作用。我能够从莎士比亚数据集中产生一个(看似)随机抽取的10个单词样本:

  SELECT word FROM 
(SELECT rand()as random,word FROM [publicdata:samples.shakespeare] ORDER BY random)
LIMIT 10

我的问题是:使用此方法而不是参考手册的高级示例部分中定义的HASH()方法有什么缺点? https://developers.google.com/bigquery/query-reference

解决方案

找到它的好工作:)。我最近请求了这个函数,但它还没有完成。



我会说RAND()的优点是结果会有所不同,而HASH ()将继续给你相同的结果为相同的值(不保证随着时间的推移,但你得到的想法)。

如果你想要的可变性RAND( )虽然仍然可以得到一致的结果 - 你可以用一个整数来给它赋值,就像在RAND(3)中一样。



注意,你粘贴的例子是做一个完整的的随机值 - 对于足够大的输入,这种方法不会扩展。



一种可扩展的方法,可以随机排列10行:

  SELECT word 
FROM [publicdata:samples.shakespeare]
WHERE RAND()< 10/164656

其中10是我想要得到的近似结果数量, 164656表的行数


I just discovered that the RAND() function, while undocumented, works in BigQuery. I was able to generate a (seemingly) random sample of 10 words from the Shakespeare dataset using:

SELECT word FROM
(SELECT rand() as random,word FROM [publicdata:samples.shakespeare] ORDER BY random)
LIMIT 10

My question is: Are there any disadvantages to using this approach instead of the HASH() method defined in the "Advanced examples" section of the reference manual? https://developers.google.com/bigquery/query-reference

解决方案

Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

A scalable approach, to get around 10 random rows:

SELECT word
FROM [publicdata:samples.shakespeare]
WHERE RAND() < 10/164656

(where 10 is the approximate number of results I want to get, and 164656 the number of rows that table has)

这篇关于Google BigQuery中的随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆