BigQuery-复杂的相关查询 [英] BigQuery - Complex Correlated query

查看:99
本文介绍了BigQuery-复杂的相关查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查询Google BigQuery公共Reddit数据集.我的目标是使用 Jaccards索引计算子索引的相似度,该索引的定义如下:

I am trying to query Google BigQuery public Reddit Dataset. My goal is to compute the similarity of subreddits using Jaccards' Index , which is defined by:

我的计划是在2016年8月按评论数选择前N个子reddit,然后计算其笛卡尔积,以subreddit1, subreddit2形状获得所有子reddit的组合.

My plan is to select the top N=1000 subreddits in terms of number of comments in August 2016. Then compute their cartesian product to get the combinations of all the subreddits in a subreddit1, subreddit2 shape.

然后使用这些组合的行查询subreddit1和subreddit2之间的用户并集以及交集.

Then use those rows of combinations to query the union of users between subreddit1 and subreddit 2 as well as the intersection.

到目前为止,我的查询是这样:

The query I have so far is this:

SELECT 
  subreddit1,
  subreddit2,
  (SELECT 
    COUNT(DISTINCT author)
  FROM `fh-bigquery.reddit_comments.2016_08`
  WHERE subreddit = subreddit1
    OR subreddit = subreddit2
  LIMIT 1
  ) as subreddits_union,

  (
    SELECT 
      COUNT(DISTINCT author)
    FROM `fh-bigquery.reddit_comments.2016_08`
    WHERE subreddit = subreddit1
    AND author IN ( 
       SELECT author 
       FROM `fh-bigquery.reddit_comments.2016_08`
       WHERE subreddit= subreddit2
       GROUP BY author 
      ) as subreddits_intersection

FROM

(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
 FROM  (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 1000
   ) a
 CROSS JOIN (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 1000
   ) b
 WHERE a.subreddit < b.subreddit
  )

理想情况下会给出结果:

Which ideally would give the results:

subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
   Art     |  Politics |      50000      |      21000
   Art     |  Science  |      92320      |      15000
   ...     |  ...      |      ...        |      ...

但是,此查询给我以下BigQuery错误: Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

However, this query gives me the following BigQuery error: Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

我了解.但是,我认为此查询不能转换为有效的联接.鉴于BQ没有apply方法,是否有任何方法可以设置此查询而无需诉诸单个查询?也许是PARTITION BY吗?

Which I understand. However I don't think this query can be translated into an efficient join. Given that BQ does not have an apply method, is there any way this query could be set up without resorting to individual queries? Maybe with a PARTITION BY ?

推荐答案

感谢您的回答.此代码在返回subreddit联合中效果很好,但是,您将如何实现交集?

Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?

也许与

WITH top_most AS (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 20
),
authors AS (
  SELECT DISTINCT author, subreddit
  FROM `fh-bigquery.reddit_comments.2016_08`
)
SELECT 
count(DISTINCT a1.author),
subreddit1, subreddit2
FROM
(
  SELECT t1.subreddit subreddit1, t2.subreddit subreddit2
  FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000
)
INNER JOIN authors a1 on a1.subreddit = subreddit1
INNER JOIN authors a2 on a2.subreddit = subreddit2
WHERE a1.author = a2.author
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2

这篇关于BigQuery-复杂的相关查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆