如何向查询结果添加整数唯一标识 - __efficiently__? [英] How to add an integer unique id to query results - __efficiently__?

查看:197
本文介绍了如何向查询结果添加整数唯一标识 - __efficiently__?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个查询, select * from ... (这可能是CTAS语句的一部分)

目标是添加一个额外的列, ID ,其中 ID 是一个唯一的整数。

  select ... as ID,* from ... 






  • ID 不必可以是顺序的(可能存在空白)
  • ID可以是任意的(不必在结果集中表示特定的顺序)
>




row_number 从逻辑上解决了问题 -



<$选择row_number()over()作为ID,* from ...

现在的问题是,至少目前为止,使用单个reducer(hive)/ task(spark)来实现全局 row_number (no partition by ), 。

解决方案

如果您使用Spark-sql,最好的办法是使用内置函数

monotonically_increasing_id

它在单独的列中生成唯一的随机ID。
正如你所说的,你不需要它是连续的,所以这应该是理想的满足你的要求。


Given a query, select * from ... (that might be part of CTAS statement)

The goal is to add an additional column, ID, where ID is a unique integer.

select ... as ID,* from ...

P.s.

  • ID does not have to be sequential (there could be gaps)
  • The ID could be arbitrary (doesn't have to represent a specific order within the result set)

row_number logically solves the problem -

select row_number() over () as ID,* from ...

The problem is, that at least for now, global row_number (no partition by) is being implemented using a single reducer (hive) / task (spark).

解决方案

If you are using Spark-sql your best bet would be to use the inbuilt function

monotonically_increasing_id

which generates unique random id in a separate column. And as you said you don't need it to be sequential so this should ideally suffice your requirement.

这篇关于如何向查询结果添加整数唯一标识 - __efficiently__?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆