BigQuery中的RANK或ROW_NUMBER在大型数据集中 [英] RANK or ROW_NUMBER in BigQuery over a large dataset
本文介绍了BigQuery中的RANK或ROW_NUMBER在大型数据集中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要在BigQuery中将行号添加到大型(大约10亿行)的数据集中。当我尝试:
SELECT
*
ROW_NUMBER()OVER(ORDER BY d_arf DESC)plarf
FROM [trigram.trigrams8]
我得到执行查询时资源超限,因为一个分析/窗口函数需要适合一个节点。
如何将行号添加到BigQuery中的大数据集中?
FROM(
SELECT STRING(year)+ STRING(month)+ STRING(mother_age)+ state
ROW_NUMBER()OVER(PARTITION BY year ORDER BY enc)plarf,
year
FROM [publicdata:samples.natality])a
LEFT JOIN(
SELECT COUNT(*)c,year + 1 year,SUM(c)OVER(ORDER BY year)sumc
FROM [publicdata:samples.natality]
GROUP BY year
)b
ON a.year = b.year
- 我想要做一个ROW_NUMBER()OVER(),但我不能,因为元素太多。
- 有一个OVER(PARTITION)解决了这个问题,但现在每个分区都以1开头。 / li>
- 但没关系。在另一个子查询中,我将计算每个分区中有多少个元素。
- 周围的查询将获取每个分区的row_number,并将其添加到本地分区计数。
- Ta da。
I need to add row numbers to a large (ca. billion rows) dataset in BigQuery. When I try:
SELECT
*
ROW_NUMBER() OVER (ORDER BY d_arf DESC) plarf
FROM [trigram.trigrams8]
I get "Resources exceeded during query execution.", because an analytic/window function needs to fit in one node.
How can I add row numbers to a large dataset in BigQuery?
解决方案
You didn't give me a working query, so I had to create my own, so you'll need to translate it to your own problem space. Also I'm not sure why do you want to give a row number to each row in such a huge dataset, but challenge accepted:
SELECT a.enc, plarf, plarf+COALESCE(INTEGER(sumc), (0)) row_num
FROM (
SELECT STRING(year)+STRING(month)+STRING(mother_age)+state enc,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY enc) plarf,
year
FROM [publicdata:samples.natality] ) a
LEFT JOIN (
SELECT COUNT(*) c, year+1 year, SUM(c) OVER(ORDER BY year) sumc
FROM [publicdata:samples.natality]
GROUP BY year
) b
ON a.year=b.year
- I want to do a ROW_NUMBER() OVER(), but I can't because there are too many elements.
- Having an OVER(PARTITION) fixes this issue, but now each partition starts with 1.
- But that's OK. On another subquery I will count how many elements are there in each partition.
- And the surrounding query will take the row_number of each partition, and add it to the local-to-the-partition count.
- Ta da.
这篇关于BigQuery中的RANK或ROW_NUMBER在大型数据集中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文