BigQuery 中的行号? [英] Row number in BigQuery?

查看:25
本文介绍了BigQuery 中的行号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法获取 BigQuery 中每条记录的行号?(从规范来看,我没有看到任何关于它的内容)有一个 NTH() 函数,但它适用于重复的字段.

在 BigQuery 中有一些不需要行号的场景,例如使用 TOP() 或 LIMIT 函数.但是,我需要它来模拟一些分析函数,例如累积 sum().为此,我需要用序列号标识每条记录.有什么解决方法吗?

预先感谢您的帮助!

狮子座

解决方案

2018 更新:如果你想要的只是每一行的唯一 ID

#standardSQL选择 GENERATE_UUID() uuid, *发件人表

2018 #standardSQL 解决方案:

SELECTROW_NUMBER() OVER() row_number,contributor_username,数数从 (SELECT 贡献者_用户名, COUNT(*) 计数来自`publicdata.samples.wikipedia`GROUP BY 贡献者_用户名按计数 DESC 排序限制 5)

<小时><块引用>

但是关于查询执行期间资源超出:无法在分配的内存中执行查询.OVER() 运算符使用了太多内存.."

好的,让我们重现那个错误:

SELECT *, ROW_NUMBER() OVER()从`publicdata.samples.natality`

是的 - 发生这种情况是因为 OVER() 需要将所有数据放入一个 VM - 您可以使用 PARTITION 来解决:

SELECT *, ROW_NUMBER() OVER(PARTITION BY year, month) rn从`publicdata.samples.natality`

<小时><块引用>

但现在很多行都有相同的行号,我想要的只是每行一个不同的 ID"

好的,好的.让我们使用分区来为每一行提供一个行号,然后我们将该行号与分区字段结合起来以获得每行的唯一 ID:

SELECT *, FORMAT('%i-%i-%i', year, month, ROW_NUMBER() OVER(PARTITION BY year, month)) id从`publicdata.samples.natality`

<小时>

2013 年的原始解决方案:

好消息:BigQuery 现在有一个 row_number 函数.

简单例子:

SELECT [field], ROW_NUMBER() OVER()发件人 [表格]GROUP BY [字段]

更复杂的工作示例:

SELECTROW_NUMBER() OVER() row_number,贡献者_用户名,数数,从 (选择贡献者_用户名,计数(*)计数,来自 [publicdata:samples.wikipedia]GROUP BY 贡献者_用户名按计数 DESC 排序限制 5)

Is there any way to get row number for each record in BigQuery? (From the specs, I haven't seen anything about it) There is a NTH() function, but that applies to repeated fields.

There are some scenarios where row number is not necessary in BigQuery, such as the use of TOP() or LIMIT function. However, I need it to simulate some analytical functions, such as a cumulative sum(). For that purpose I need to identify each record with a sequential number. Any workaround on this?

Thanks in advance for your help!

Leo

解决方案

2018 update: If all you want is a unique id for each row

#standardSQL
SELECT GENERATE_UUID() uuid
 , * 
FROM table

2018 #standardSQL solution:

SELECT
  ROW_NUMBER() OVER() row_number, contributor_username,
  count
FROM (
  SELECT contributor_username, COUNT(*) count
  FROM `publicdata.samples.wikipedia`
  GROUP BY contributor_username
  ORDER BY COUNT DESC
  LIMIT 5)


But what about "Resources exceeded during query execution: The query could not be executed in the allotted memory. OVER() operator used too much memory.."

Ok, let's reproduce that error:

SELECT *, ROW_NUMBER() OVER() 
FROM `publicdata.samples.natality` 

Yes - that happens because OVER() needs to fit all data into one VM - which you can solve with PARTITION:

SELECT *, ROW_NUMBER() OVER(PARTITION BY year, month) rn 
FROM `publicdata.samples.natality` 


"But now many rows have the same row number and all I wanted was a different id for each row"

Ok, ok. Let's use partitions to give a row number to each row, and let's combine that row number with the partition fields to get an unique id per row:

SELECT *
  , FORMAT('%i-%i-%i', year, month, ROW_NUMBER() OVER(PARTITION BY year, month)) id
FROM `publicdata.samples.natality` 


The original 2013 solution:

Good news: BigQuery now has a row_number function.

Simple example:

SELECT [field], ROW_NUMBER() OVER()
FROM [table]
GROUP BY [field]

More complex, working example:

SELECT
  ROW_NUMBER() OVER() row_number,
  contributor_username,
  count,
FROM (
  SELECT contributor_username, COUNT(*) count,
  FROM [publicdata:samples.wikipedia]
  GROUP BY contributor_username
  ORDER BY COUNT DESC
  LIMIT 5)

这篇关于BigQuery 中的行号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆