Spark数据集唯一ID的性能-row_number与monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

查看：760 发布时间：2020/9/4 8:02:42 scala apache-spark apache-spark-sql apache-spark-dataset

本文介绍了Spark数据集唯一ID的性能-row_number与monotonically_increasing_id的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想为数据集行分配一个唯一的ID.我知道有两个实现选项:

I want to assign a unique Id to my dataset rows. I know that there are two implementation options:

第一个选项:

First option:

import org.apache.spark.sql.expressions.Window;
ds.withColumn("id",row_number().over(Window.orderBy("a column")))

第二个选项:

Second option:

df.withColumn("id", monotonically_increasing_id())

第二个选项不是顺序ID，也没有关系.

The second option is not sequential ID and it doesn't really matter.

我试图弄清楚这些实现是否存在任何性能问题.也就是说，如果其中一个选项与另一个选项相比非常慢.更有意思的是:"monotonically_increasing_id在row_number上非常快，因为它不是顺序的或..."

I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow compared to the other. Something more meaningful that: "monotonically_increasing_id is very fast over row_number because it's not sequential or ..."

Spark数据集唯一ID的性能-row_number与monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark数据集唯一ID的性能-row_number与monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭