使用 udf 在 PySpark 数据框中将纪元转换为日期时间 [英] Converting epoch to datetime in PySpark data frame using udf
问题描述
我有一个带有此架构的 PySpark 数据框:
I have a PySpark dataframe with this schema:
root
|-- epoch: double (nullable = true)
|-- var1: double (nullable = true)
|-- var2: double (nullable = true)
其中纪元以秒为单位,应转换为日期时间.为此,我定义了一个用户定义函数 (udf),如下所示:
Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows:
from pyspark.sql.functions import udf
import time
def epoch_to_datetime(x):
return time.localtime(x)
# return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x))
# return x * 0 + 1
epoch_to_datetime_udf = udf(epoch_to_datetime, DoubleType())
df.withColumn("datetime", epoch_to_datetime(df2.epoch)).show()
我收到此错误:
---> 21 return time.localtime(x)
22 # return x * 0 + 1
23
TypeError: a float is required
如果我只是在函数中返回 x + 1
,它就可以工作.在 time.localtime(x) 中尝试
没有帮助,我仍然收到错误消息.在 float(x)
或 float(str(x))
或 numpy.float(x)
)udf
之外,time.localtime(1.514687216E9)
或其他数字工作正常.使用 datetime
包将 epoch
转换为 datetim 会导致类似的错误.
If I simply return x + 1
in the function, it works. Trying float(x)
or float(str(x))
or numpy.float(x)
in time.localtime(x)
does not help and I still get an error. Outside of udf
, time.localtime(1.514687216E9)
or other numbers works fine. Using datetime
package to convert epoch
to datetim results in similar errors.
time
和 datetime
包似乎不喜欢使用来自 PySpark 的 DoubleType
.有什么想法可以解决这个问题吗?谢谢.
It seems that time
and datetime
packages do not like to fed with DoubleType
from PySpark. Any ideas how I can solve this issue? Thanks.
推荐答案
您不需要 udf
函数
You don't need a udf
function for that
您只需要将双纪元列转换为timestampType()
,然后使用data_format
函数,如下所示
All you need is to cast the double epoch column to timestampType()
and then use data_format
function as below
from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.date_format(df.epoch.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))
这会给你一个字符串日期
this will give you a string date
root
|-- epoch: string (nullable = true)
|-- var1: double (nullable = true)
|-- var2: double (nullable = true)
你可以使用to_date
函数如下
from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.to_date(df.epoch.cast(dataType=t.TimestampType())))
这会给你 date
作为 datatype
到 epoch
列
which would give you date
as datatype
to epoch
column
root
|-- epoch: date (nullable = true)
|-- var1: double (nullable = true)
|-- var2: double (nullable = true)
希望回答对你有帮助
这篇关于使用 udf 在 PySpark 数据框中将纪元转换为日期时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!