PySpark数据帧将异常的字符串格式转换为时间戳 [英] PySpark dataframe convert unusual string format to Timestamp
问题描述
我正在通过Spark 1.5.0使用PySpark. 我在日期时间列的行中有一种不寻常的String格式.看起来像这样:
I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(daytetime='2016_08_21 11_31_08')]
是否可以将这种非正统的yyyy_mm_dd hh_mm_dd
格式转换为时间戳?
最终可能会遵循
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd
format into a Timestamp?
Something that can eventually come along the lines of
df = df.withColumn("date_time",df.daytetime.astype('Timestamp'))
我曾经认为regexp_replace
之类的Spark SQL函数可以工作,但是我当然需要替换
_
和-
在日期的一半
和_
和时间部分为:
.
I had thought that Spark SQL functions like regexp_replace
could work, but of course I need to replace
_
with -
in the date half
and _
with :
in the time part.
我当时想我可以使用substring
将列拆分为2,然后从时间结束算起.然后分别执行"regexp_replace",然后进行连接.但这似乎有很多操作?有没有更简单的方法?
I was thinking I could split the column in 2 using substring
and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?
推荐答案
火花> = 2.2
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
.show(1, False))
## +-------------------+-------------------+
## |dt |parsed |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+
火花< 2.2
unix_timestamp
无法处理的任何事情:
It is nothing that unix_timestamp
cannot handle:
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
# For Spark <= 1.5
# See issues.apache.org/jira/browse/SPARK-11724
.cast("double")
.cast("timestamp"))
.show(1, False))
## +-------------------+---------------------+
## |dt |parsed |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+
在两种情况下,格式字符串都应与Java SimpleDateFormat
.
In both cases the format string should be compatible with Java SimpleDateFormat
.
这篇关于PySpark数据帧将异常的字符串格式转换为时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!