BigQuery 中针对大型数据集的 RANK 或 ROW_NUMBER [英] RANK or ROW_NUMBER in BigQuery over a large dataset
问题描述
我需要将行号添加到 BigQuery 中的大型(大约十亿行)数据集.当我尝试时:
I need to add row numbers to a large (ca. billion rows) dataset in BigQuery. When I try:
SELECT
*
ROW_NUMBER() OVER (ORDER BY d_arf DESC) plarf
FROM [trigram.trigrams8]
我收到查询执行期间资源超出",因为分析/窗口函数需要适合一个节点.
I get "Resources exceeded during query execution.", because an analytic/window function needs to fit in one node.
如何向 BigQuery 中的大型数据集添加行号?
How can I add row numbers to a large dataset in BigQuery?
推荐答案
你没有给我一个有效的查询,所以我必须创建我自己的,所以你需要将它转换到你自己的问题空间.此外,我不确定为什么要为如此庞大的数据集中的每一行指定一个行号,但已接受挑战:
You didn't give me a working query, so I had to create my own, so you'll need to translate it to your own problem space. Also I'm not sure why do you want to give a row number to each row in such a huge dataset, but challenge accepted:
SELECT a.enc, plarf, plarf+COALESCE(INTEGER(sumc), (0)) row_num
FROM (
SELECT STRING(year)+STRING(month)+STRING(mother_age)+state enc,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY enc) plarf,
year
FROM [publicdata:samples.natality] ) a
LEFT JOIN (
SELECT COUNT(*) c, year+1 year, SUM(c) OVER(ORDER BY year) sumc
FROM [publicdata:samples.natality]
GROUP BY year
) b
ON a.year=b.year
- 我想做一个 ROW_NUMBER() OVER(),但我不能,因为元素太多.
- 使用 OVER(PARTITION) 解决了这个问题,但现在每个分区都以 1 开头.
- 不过没关系.在另一个子查询中,我将计算每个分区中有多少个元素.
- 周围的查询将获取每个分区的 row_number,并将其添加到 local-to-the-partition 计数中.
- 达达.
这篇关于BigQuery 中针对大型数据集的 RANK 或 ROW_NUMBER的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!