一次计算UDF [英] Calculate UDF once

查看:112
本文介绍了一次计算UDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在仅计算一次的pyspark数据帧中有一个UUID列,以便我可以在其他数据帧中选择该列并使UUID相同.但是,当我选择列时,会重新计算UUID列的UDF.

I want to have a UUID column in a pyspark dataframe that is calculated only once, so that I can select the column in a different dataframe and have the UUIDs be the same. However, the UDF for the UUID column is recalculated when I select the column.

这就是我想要做的:

>>> uuid_udf = udf(lambda: str(uuid.uuid4()), StringType())
>>> a = spark.createDataFrame([[1, 2]], ['col1', 'col2'])
>>> a = a.withColumn('id', uuid_udf())
>>> a.collect()
[Row(col1=1, col2=2, id='5ac8f818-e2d8-4c50-bae2-0ced7d72ef4f')]
>>> b = a.select('id')
>>> b.collect()
[Row(id='12ec9913-21e1-47bd-9c59-6ddbe2365247')]  # Wanted this to be the same ID as above

可能的解决方法:rand()

一个可能的解决方法可能是使用pyspark.sql.functions.rand()作为我的随机性来源.但是,有两个问题:

A possible workaround might be to use pyspark.sql.functions.rand() as my source of randomness. However, there are two problems:

1)我想在UUID中包含字母,而不仅仅是数字,这样它就不需要那么长

1) I'd like to have letters, not just numbers, in the UUID, so that it doesn't need to be quite as long

2)尽管在技术上可行,但它会产生难看的UUID:

2) Though it technically works, it produces ugly UUIDs:

>>> from pyspark.sql.functions import rand, round
>>> a = a.withColumn('id', round(rand() * 10e16))
>>> a.collect()
[Row(col1=1, col2=2, id=7.34745165108606e+16)]

推荐答案

使用Spark内置的 uuid 函数代替:

Use Spark built-in uuid function instead:

a = a.withColumn('id', expr("uuid()"))
b = a.select('id')

b.collect()
[Row(id='da301bea-4927-4b6b-a1cf-518dea8705c4')]

a.collect()
[Row(col1=1, col2=2, id='da301bea-4927-4b6b-a1cf-518dea8705c4')]

这篇关于一次计算UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆