使用udf在PySpark数据框中将纪元转换为日期时间 [英] Converting epoch to datetime in PySpark data frame using udf

查看:95
本文介绍了使用udf在PySpark数据框中将纪元转换为日期时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下架构的PySpark数据框:

I have a PySpark dataframe with this schema:

root
 |-- epoch: double (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

以秒为单位的纪元,应将其转换为日期时间.为此,我定义了一个用户定义的函数(udf),如下所示:

Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows:

from pyspark.sql.functions import udf    
import time
def epoch_to_datetime(x):
    return time.localtime(x)
    # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x))
    # return x * 0 + 1

epoch_to_datetime_udf = udf(epoch_to_datetime, DoubleType())
df.withColumn("datetime", epoch_to_datetime(df2.epoch)).show()

我收到此错误:

---> 21     return time.localtime(x)
    22     # return x * 0 + 1
    23 
    TypeError: a float is required

如果我只在函数中返回x + 1,它就可以工作.在time.localtime(x)中尝试float(x)float(str(x))numpy.float(x)并没有帮助,但仍然出现错误.在udftime.localtime(1.514687216E9)或其他数字之外可以正常工作.使用datetime包将epoch转换为datetim会导致类似错误.

If I simply return x + 1 in the function, it works. Trying float(x) or float(str(x)) or numpy.float(x) in time.localtime(x) does not help and I still get an error. Outside of udf, time.localtime(1.514687216E9) or other numbers works fine. Using datetime package to convert epoch to datetim results in similar errors.

似乎timedatetime软件包不喜欢从PySpark使用DoubleType进行填充.有什么想法可以解决这个问题吗?谢谢.

It seems that time and datetime packages do not like to fed with DoubleType from PySpark. Any ideas how I can solve this issue? Thanks.

推荐答案

您不需要udf函数

You don't need a udf function for that

您需要做的是将双历元列投射到timestampType() ,然后使用如下所示的data_format函数

All you need is to cast the double epoch column to timestampType() and then use data_format function as below

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.date_format(df.epoch.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))

这将为您提供一个字符串日期

this will give you a string date

root
 |-- epoch: string (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

并且您可以使用to_date函数,如下所示

And you can use to_date function as following

from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.to_date(df.epoch.cast(dataType=t.TimestampType())))

这将为datatype赋予dateepoch

root
 |-- epoch: date (nullable = true)
 |-- var1: double (nullable = true)
 |-- var2: double (nullable = true)

我希望答案会有所帮助

这篇关于使用udf在PySpark数据框中将纪元转换为日期时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆