BigQuery-复杂的相关查询 [英] BigQuery - Complex Correlated query
问题描述
我正在尝试查询Google BigQuery公共Reddit数据集.我的目标是使用 Jaccards索引计算子索引的相似度,该索引的定义如下:
I am trying to query Google BigQuery public Reddit Dataset. My goal is to compute the similarity of subreddits using Jaccards' Index , which is defined by:
我的计划是在2016年8月按评论数选择前N个子reddit,然后计算其笛卡尔积,以subreddit1, subreddit2
形状获得所有子reddit的组合.
My plan is to select the top N=1000 subreddits in terms of number of comments in August 2016. Then compute their cartesian product to get the combinations of all the subreddits in a subreddit1, subreddit2
shape.
然后使用这些组合的行查询subreddit1和subreddit2之间的用户并集以及交集.
Then use those rows of combinations to query the union of users between subreddit1 and subreddit 2 as well as the intersection.
到目前为止,我的查询是这样:
The query I have so far is this:
SELECT
subreddit1,
subreddit2,
(SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
OR subreddit = subreddit2
LIMIT 1
) as subreddits_union,
(
SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
AND author IN (
SELECT author
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit= subreddit2
GROUP BY author
) as subreddits_intersection
FROM
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
FROM (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) a
CROSS JOIN (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) b
WHERE a.subreddit < b.subreddit
)
理想情况下会给出结果:
Which ideally would give the results:
subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
Art | Politics | 50000 | 21000
Art | Science | 92320 | 15000
... | ... | ... | ...
但是,此查询给我以下BigQuery错误:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
However, this query gives me the following BigQuery error:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
我了解.但是,我认为此查询不能转换为有效的联接.鉴于BQ没有apply方法,是否有任何方法可以设置此查询而无需诉诸单个查询?也许是PARTITION BY
吗?
Which I understand. However I don't think this query can be translated into an efficient join. Given that BQ does not have an apply method, is there any way this query could be set up without resorting to individual queries? Maybe with a PARTITION BY
?
推荐答案
感谢您的回答.此代码在返回subreddit联合中效果很好,但是,您将如何实现交集?
Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?
也许与
WITH top_most AS (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 20
),
authors AS (
SELECT DISTINCT author, subreddit
FROM `fh-bigquery.reddit_comments.2016_08`
)
SELECT
count(DISTINCT a1.author),
subreddit1, subreddit2
FROM
(
SELECT t1.subreddit subreddit1, t2.subreddit subreddit2
FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000
)
INNER JOIN authors a1 on a1.subreddit = subreddit1
INNER JOIN authors a2 on a2.subreddit = subreddit2
WHERE a1.author = a2.author
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2
这篇关于BigQuery-复杂的相关查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!