Spark 数据集唯一 id 性能 - row_number 与 monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

查看：92 发布时间：2021/11/14 22:30:21 scala apache-spark apache-spark-sql apache-spark-dataset

本文介绍了Spark 数据集唯一 id 性能 - row_number 与 monotonically_increasing_id的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想为我的数据集行分配一个唯一的 ID.我知道有两个实现选项:

I want to assign a unique Id to my dataset rows. I know that there are two implementation options:

第一个选项:

First option:

import org.apache.spark.sql.expressions.Window;
ds.withColumn("id",row_number().over(Window.orderBy("a column")))

第二个选项:

Second option:

df.withColumn("id", monotonically_increasing_id())

第二个选项不是顺序 ID，它并不重要.

The second option is not sequential ID and it doesn't really matter.

我想弄清楚这些实现是否存在任何性能问题.也就是说，如果此选项中的一个与另一个相比非常慢.更有意义的是:monotonically_increasing_id 比 row_number 快，因为它不是连续的或......"

I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow compared to the other. Something more meaningful that: "monotonically_increasing_id is very fast over row_number because it's not sequential or ..."

Spark 数据集唯一 id 性能 - row_number 与 monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 数据集唯一 id 性能 - row_number 与 monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭