使用udf在PySpark数据框中将纪元转换为日期时间 [英] Converting epoch to datetime in PySpark data frame using udf
问题描述
我有一个具有以下架构的PySpark数据框:
I have a PySpark dataframe with this schema:
root
|-- epoch: double (nullable = true)
|-- var1: double (nullable = true)
|-- var2: double (nullable = true)
以秒为单位的纪元,应将其转换为日期时间.为此,我定义了一个用户定义的函数(udf),如下所示:
Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows:
from pyspark.sql.functions import udf
import time
def epoch_to_datetime(x):
return time.localtime(x)
# return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x))
# return x * 0 + 1
epoch_to_datetime_udf = udf(epoch_to_datetime, DoubleType())
df.withColumn("datetime", epoch_to_datetime(df2.epoch)).show()
我收到此错误:
---> 21 return time.localtime(x)
22 # return x * 0 + 1
23
TypeError: a float is required
如果我只在函数中返回x + 1
,它就可以工作.在time.localtime(x)
中尝试float(x)
或float(str(x))
或numpy.float(x)
并没有帮助,但仍然出现错误.在udf
,time.localtime(1.514687216E9)
或其他数字之外可以正常工作.使用datetime
包将epoch
转换为datetim会导致类似错误.
If I simply return x + 1
in the function, it works. Trying float(x)
or float(str(x))
or numpy.float(x)
in time.localtime(x)
does not help and I still get an error. Outside of udf
, time.localtime(1.514687216E9)
or other numbers works fine. Using datetime
package to convert epoch
to datetim results in similar errors.
似乎time
和datetime
软件包不喜欢从PySpark使用DoubleType
进行填充.有什么想法可以解决这个问题吗?谢谢.
It seems that time
and datetime
packages do not like to fed with DoubleType
from PySpark. Any ideas how I can solve this issue? Thanks.
推荐答案
您不需要udf
函数
You don't need a udf
function for that
您需要做的是将双历元列投射到timestampType()
,然后使用如下所示的data_format
函数
All you need is to cast the double epoch column to timestampType()
and then use data_format
function as below
from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.date_format(df.epoch.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))
这将为您提供一个字符串日期
this will give you a string date
root
|-- epoch: string (nullable = true)
|-- var1: double (nullable = true)
|-- var2: double (nullable = true)
并且您可以使用to_date
函数,如下所示
And you can use to_date
function as following
from pyspark.sql import functions as f
from pyspark.sql import types as t
df.withColumn('epoch', f.to_date(df.epoch.cast(dataType=t.TimestampType())))
这将为datatype
赋予date
到epoch
列
root
|-- epoch: date (nullable = true)
|-- var1: double (nullable = true)
|-- var2: double (nullable = true)
我希望答案会有所帮助
这篇关于使用udf在PySpark数据框中将纪元转换为日期时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!