pySpark withColumn 带函数 [英] pySpark withColumn with a function

查看:58
本文介绍了pySpark withColumn 带函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 2 列的数据框:account_idemail_address,现在我想再添加一列 updated_email_address,我称之为email_address 上的函数以获取 updated_email_address.这是我的代码:

I have a dataframe which has 2 columns: account_id and email_address, now I want to add one more column updated_email_address which i call some function on email_address to get the updated_email_address. here is my code:

def update_email(email):
  print("== email to be updated: " + email)
  today = datetime.date.today()
  updated = substring(email, -8, 8) + str(today.strftime('%m')) + str(today.strftime('%d')) + "_updated"
  return updated

df.withColumn('updated_email_address', update_email(df.email_address))

但结果显示 updated_email_address 列为空:

but the result showed updated_email_address column as null:

+---------------+--------------+---------------------+
|account_id     |email_address |updated_email_address|
+---------------+--------------+---------------------+
|123456gd7tuhha |abc@test.com  |null           |
|djasevneuagsj1 |cde@test.com  |null           |
+---------------+--------------+---------------+

在函数updated_email里面打印出来:

Column<b'(email_address + == email to be udpated: )'>

它还将 df 的列数据类型显示为:

also it showed the df's column data type as:

dfData:pyspark.sql.dataframe.DataFrame
account_id:string
email_address:string
updated_email_address:double

为什么updated_email_address 列类型是double?

why is updated_email_address column type of double?

推荐答案

您正在调用具有 Column 类型的 Python 函数.您必须从 update_email 创建 udf 然后使用它:

You're calling a Python function with Column type. You have to create udf from update_email and then use it:

update_email_udf = udf(update_email)

但是,我建议您不要使用 UDF 进行此类转换,您可以仅使用 Spark 内置函数(UDF 以性能不佳而闻名):

However, I'd suggest you to not use UDF fot such transformation, you could do it using only Spark built-in functions (UDFs are known for bad performance) :

df.withColumn('updated_email_address',
              concat(substring(col("email_address"), -8, 8), date_format(current_date(), "ddMM"), lit("_updated"))
             ).show()

您可以在此处找到所有 Spark SQL内置函数.

You can find here all Spark SQL built-in functions.

这篇关于pySpark withColumn 带函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆