使用 udf 在 PySpark 数据框中将纪元转换为日期时间 [英] Converting epoch to datetime in PySpark data frame using udf

查看:32
本文介绍了使用 udf 在 PySpark 数据框中将纪元转换为日期时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有此架构的 PySpark 数据框:

I have a PySpark dataframe with this schema:

root
 |-- epoch: double (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

其中纪元以秒为单位,应转换为日期时间.为此,我定义了一个用户定义函数 (udf),如下所示:

Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows:

from pyspark.sql.functions import udf    
import time
def epoch_to_datetime(x):
    return time.localtime(x)
    # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x))
    # return x * 0 + 1

epoch_to_datetime_udf = udf(epoch_to_datetime, DoubleType())
df.withColumn("datetime", epoch_to_datetime(df2.epoch)).show()

我收到此错误:

---> 21     return time.localtime(x)
    22     # return x * 0 + 1
    23 
    TypeError: a float is required

如果我只是在函数中返回 x + 1 ,它就可以工作.在 time.localtime(x) 中尝试 float(x)float(str(x))numpy.float(x)) 没有帮助,我仍然收到错误消息.在 udf 之外,time.localtime(1.514687216E9) 或其他数字工作正常.使用 datetime 包将 epoch 转换为 datetim 会导致类似的错误.

If I simply return x + 1 in the function, it works. Trying float(x) or float(str(x)) or numpy.float(x) in time.localtime(x) does not help and I still get an error. Outside of udf, time.localtime(1.514687216E9) or other numbers works fine. Using datetime package to convert epoch to datetim results in similar errors.

timedatetime 包似乎不喜欢使用来自 PySpark 的 DoubleType.有什么想法可以解决这个问题吗?谢谢.

It seems that time and datetime packages do not like to fed with DoubleType from PySpark. Any ideas how I can solve this issue? Thanks.

推荐答案

您不需要 udf 函数

You don't need a udf function for that

您只需要将双纪元列转换为timestampType(),然后使用data_format 函数,如下所示

All you need is to cast the double epoch column to timestampType() and then use data_format function as below

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.date_format(df.epoch.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))

这会给你一个字符串日期

this will give you a string date

root
 |-- epoch: string (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

你可以使用to_date函数如下

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.to_date(df.epoch.cast(dataType=t.TimestampType())))

这会给你 date 作为 datatypeepoch

which would give you date as datatype to epoch column

root
 |-- epoch: date (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

希望回答对你有帮助

这篇关于使用 udf 在 PySpark 数据框中将纪元转换为日期时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆