Spark数据集唯一ID的性能-row_number与monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

查看:760
本文介绍了Spark数据集唯一ID的性能-row_number与monotonically_increasing_id的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为数据集行分配一个唯一的ID.我知道有两个实现选项:

I want to assign a unique Id to my dataset rows. I know that there are two implementation options:

  1. 第一个选项:

  1. First option:

import org.apache.spark.sql.expressions.Window;
ds.withColumn("id",row_number().over(Window.orderBy("a column")))

  • 第二个选项:

  • Second option:

    df.withColumn("id", monotonically_increasing_id())
    

  • 第二个选项不是顺序ID,也没有关系.

    The second option is not sequential ID and it doesn't really matter.

    我试图弄清楚这些实现是否存在任何性能问题.也就是说,如果其中一个选项与另一个选项相比非常慢.更有意思的是:"monotonically_increasing_id在row_number上非常快,因为它不是顺序的或..."

    I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow compared to the other. Something more meaningful that: "monotonically_increasing_id is very fast over row_number because it's not sequential or ..."

    推荐答案

    monotically_increasing_id是分布式的,它根据数据的分区执行.

    monotically_increasing_id is distributed which performs according to partition of the data.

    row_number()(根据您的情况).当我们不定义partitionBy时,所有数据都发送给一个执行程序以生成行号.

    row_number() using Window function without partitionBy (as in your case) is not distributed. When we don't define partitionBy, all the data are sent to one executor for generating row number.

    因此,可以肯定的是,monotically_increasing_id()的性能将优于未定义partitionByrow_number() .

    Thus, it is certain that monotically_increasing_id() will perform better than row_number() without partitionBy defined.

    这篇关于Spark数据集唯一ID的性能-row_number与monotonically_increasing_id的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆