使用 BigQuery 进行分层随机抽样? [英] Stratified random sampling with BigQuery?

查看:15
本文介绍了使用 BigQuery 进行分层随机抽样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 BigQuery 上进行分层抽样?

How can I do stratified sampling on BigQuery?

例如,我们想要一个 10% 比例的分层样本,使用 category_id 作为层.我们的某些表中有多达 11000 个 category_id.

For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.

推荐答案

使用 #standardSQL,让我们定义我们的表和一些统计信息:

With #standardSQL, let's define our table and some stats over it:

WITH table AS (
  SELECT *, subreddit category
  FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
  SELECT *, SUM(c) OVER() total 
  FROM (
    SELECT category, COUNT(*) c 
    FROM table
    GROUP BY 1 
    HAVING c>1000000)
)

在此设置中:

  • subreddit 将成为我们的类别
  • 我们只想要评论超过 1000000 的 subreddits

因此,如果我们想要样本中每个类别的 1%:

So, if we want 1% of each category in our sample:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 1/100 
)
GROUP BY 2

或者假设我们想要大约 80,000 个样本 - 但在所有类别中按比例选择:

Or let's say we want ~80,000 samples - but chosen proportionally through all categories:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 80000/total
)
GROUP BY 2

现在,如果您想从每组中获得相同数量的样本(假设为 20,000):

Now, if you want to get the ~same number of samples from each group (let's say, 20,000):

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 20000/c
)
GROUP BY 2

如果您希望每个类别恰好有 20,000 个元素:

If you want exactly 20,000 elements from each category:

SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
)

如果您只想要每组的 2%:

If you want exactly 2% of each group:

SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2

如果最后一种方法是您想要的,那么当您真正想要获取数据时,您可能会注意到它失败了.类似于最大组大小的早期 LIMIT 将确保我们不会对超出需要的数据进行排序:

If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:

SELECT sample.*
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c

这篇关于使用 BigQuery 进行分层随机抽样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆