如何在查询结果中添加整数唯一 ID - __有效地__? [英] How to add an integer unique id to query results - __efficiently__?
问题描述
给定一个查询,select * from ...
(可能是 CTAS 语句的一部分)
Given a query, select * from ...
(that might be part of CTAS statement)
目标是添加一个额外的列,ID
,其中 ID
是一个唯一的整数.
The goal is to add an additional column, ID
, where ID
is a unique integer.
select ... as ID,* from ...
附言
ID
不必是连续的(可能有间隙)- ID 可以是任意的(不必表示结果集中的特定顺序)
ID
does not have to be sequential (there could be gaps)- The ID could be arbitrary (doesn't have to represent a specific order within the result set)
row_number 逻辑上解决了问题-
select row_number() over () as ID,* from ...
问题是,至少就目前而言,全局 row_number(没有 partition by)正在使用单个减速器 (hive)/任务 (spark) 实现.
The problem is, that at least for now, global row_number (no partition by) is being implemented using a single reducer (hive) / task (spark).
推荐答案
如果你正在使用 Spark-sql 你最好的选择是使用内置函数
If you are using Spark-sql your best bet would be to use the inbuilt function
monotonicically_increasing_id
monotonically_increasing_id
在单独的列中生成唯一的随机 ID.正如您所说,您不需要它是连续的,因此理想情况下这应该足以满足您的要求.
which generates unique random id in a separate column. And as you said you don't need it to be sequential so this should ideally suffice your requirement.
这篇关于如何在查询结果中添加整数唯一 ID - __有效地__?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!