Spark 数据集唯一 id 性能 - row_number 与 monotonically_increasing_id [英] Spark Dataset unique id performance - row_number vs monotonically_increasing_id

查看:92
本文介绍了Spark 数据集唯一 id 性能 - row_number 与 monotonically_increasing_id的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为我的数据集行分配一个唯一的 ID.我知道有两个实现选项:

I want to assign a unique Id to my dataset rows. I know that there are two implementation options:

  1. 第一个选项:

  1. First option:

import org.apache.spark.sql.expressions.Window;
ds.withColumn("id",row_number().over(Window.orderBy("a column")))

  • 第二个选项:

  • Second option:

    df.withColumn("id", monotonically_increasing_id())
    

  • 第二个选项不是顺序 ID,它并不重要.

    The second option is not sequential ID and it doesn't really matter.

    我想弄清楚这些实现是否存在任何性能问题.也就是说,如果此选项中的一个与另一个相比非常慢.更有意义的是:monotonically_increasing_id 比 row_number 快,因为它不是连续的或......"

    I'm trying to figure out is if there are any performance issues of those implementation. That is, if one of this option is very slow compared to the other. Something more meaningful that: "monotonically_increasing_id is very fast over row_number because it's not sequential or ..."

    推荐答案

    monotically_increasing_id 是分布式的,它根据数据的分区来执行.

    monotically_increasing_id is distributed which performs according to partition of the data.

    row_number() 使用 Window 函数而没有 partitionBy (如您的情况)未分发.当我们不定义partitionBy时,所有的数据都发送到一个executor生成行号.

    row_number() using Window function without partitionBy (as in your case) is not distributed. When we don't define partitionBy, all the data are sent to one executor for generating row number.

    因此,可以肯定的是,monotically_increasing_id() 会比没有定义 partitionByrow_number() 表现更好.

    Thus, it is certain that monotically_increasing_id() will perform better than row_number() without partitionBy defined.

    这篇关于Spark 数据集唯一 id 性能 - row_number 与 monotonically_increasing_id的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆