如何向查询结果添加整数唯一标识 - efficiently？ [英] How to add an integer unique id to query results - efficiently?

查看：197 发布时间：2018/6/12 14:06:38 hadoop apache-spark hive apache-spark-sql hiveql

本文介绍了如何向查询结果添加整数唯一标识 - __efficiently__？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给出一个查询， select * from ... （这可能是CTAS语句的一部分）

目标是添加一个额外的列， ID ，其中 ID 是一个唯一的整数。

  select ... as ID，* from ...

ID 不必可以是顺序的（可能存在空白）
ID可以是任意的（不必在结果集中表示特定的顺序）

row_number 从逻辑上解决了问题 -

<$选择row_number（）over（）作为ID，* from ...

现在的问题是，至少目前为止，使用单个reducer（hive）/ task（spark）来实现全局 row_number （no partition by ），。

解决方案

如果您使用Spark-sql，最好的办法是使用内置函数

monotonically_increasing_id

它在单独的列中生成唯一的随机ID。
正如你所说的，你不需要它是连续的，所以这应该是理想的满足你的要求。

Given a query, select * from ... (that might be part of CTAS statement)

The goal is to add an additional column, ID, where ID is a unique integer.

select ... as ID,* from ...

P.s.

ID does not have to be sequential (there could be gaps)
The ID could be arbitrary (doesn't have to represent a specific order within the result set)

row_number logically solves the problem -

select row_number() over () as ID,* from ...

The problem is, that at least for now, global row_number (no partition by) is being implemented using a single reducer (hive) / task (spark).

解决方案

If you are using Spark-sql your best bet would be to use the inbuilt function

monotonically_increasing_id

which generates unique random id in a separate column. And as you said you don't need it to be sequential so this should ideally suffice your requirement.

这篇关于如何向查询结果添加整数唯一标识 - __efficiently__？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何向查询结果添加整数唯一标识 - efficiently？ [英] How to add an integer unique id to query results - efficiently?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何向查询结果添加整数唯一标识 - __efficiently__？ [英] How to add an integer unique id to query results - __efficiently__?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何向查询结果添加整数唯一标识 - efficiently？ [英] How to add an integer unique id to query results - efficiently?

登录关闭