PySpark 数据帧将异常字符串格式转换为时间戳 [英] PySpark dataframe convert unusual string format to Timestamp

查看:33
本文介绍了PySpark 数据帧将异常字符串格式转换为时间戳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过 Spark 1.5.0 使用 PySpark.我在日期时间值的列的行中有一个不寻常的字符串格式.它看起来像这样:

I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:

Row[(datetime='2016_08_21 11_31_08')]

有没有办法将这种非正统的yyyy_mm_dd hh_mm_dd 格式转换为时间戳?最终可能会出现

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp? Something that can eventually come along the lines of

df = df.withColumn("date_time",df.datetime.astype('Timestamp'))

我原以为像 regexp_replace 这样的 Spark SQL 函数可以工作,但我当然需要替换_- 在日期一半和 _: 在时间部分.

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part.

我想我可以使用 substring 将列分成 2 列,并从时间结束向后计数.然后分别执行'regexp_replace',然后连接.但这似乎操作很多?有没有更简单的方法?

I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?

推荐答案

Spark >= 2.2

from pyspark.sql.functions import to_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
    .show(1, False))

## +-------------------+-------------------+
## |dt                 |parsed             |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+

火花<2.2

没有什么是 unix_timestamp 不能处理的:

It is nothing that unix_timestamp cannot handle:

from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
    # For Spark <= 1.5
    # See issues.apache.org/jira/browse/SPARK-11724 
    .cast("double")
    .cast("timestamp"))
    .show(1, False))

## +-------------------+---------------------+
## |dt                 |parsed               |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+

在这两种情况下,格式字符串都应该与 Java 兼容 SimpleDateFormat.

In both cases the format string should be compatible with Java SimpleDateFormat.

这篇关于PySpark 数据帧将异常字符串格式转换为时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆