使用BigQuery进行分层随机抽样? [英] Stratified random sampling with BigQuery?

查看:73
本文介绍了使用BigQuery进行分层随机抽样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在BigQuery上进行分层抽样?

How can I do stratified sampling on BigQuery?

例如,我们希望使用category_id作为层次的比例为10%的分层样本.我们的某些表中最多有11000个category_id.

For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.

推荐答案

使用#standardSQL,我们定义表和该表的一些统计信息:

With #standardSQL, let's define our table and some stats over it:

WITH table AS (
  SELECT *, subreddit category
  FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
  SELECT *, SUM(c) OVER() total 
  FROM (
    SELECT category, COUNT(*) c 
    FROM table
    GROUP BY 1 
    HAVING c>1000000)
)

在此设置中:

  • subreddit将是我们的类别
  • 我们只希望带有超过1000000条评论的subreddit
  • subreddit will be our category
  • we only want subreddits with more than 1000000 comments

因此,如果我们希望样本中每个类别的1%:

So, if we want 1% of each category in our sample:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 1/100 
)
GROUP BY 2

或者说我们要〜80,000个样本-但要在所有类别中按比例选择:

Or let's say we want ~80,000 samples - but chosen proportionally through all categories:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 80000/total
)
GROUP BY 2

现在,如果您想从每个组中获取相同数量的样本(例如20,000个):

Now, if you want to get the ~same number of samples from each group (let's say, 20,000):

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 20000/c
)
GROUP BY 2

如果您想从每个类别中准确获取20,000个元素,则:

If you want exactly 20,000 elements from each category:

SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
)

如果您希望每个组的确切比例为2%:

If you want exactly 2% of each group:

SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2

如果最后一种方法是您想要的,那么您可能会在实际想要获取数据时注意到它失败了.类似于最大的组大小的早期LIMIT可以确保我们不会对所需的数据进行更多排序:

If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:

SELECT sample.*
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c

这篇关于使用BigQuery进行分层随机抽样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆