解析日期格式 [英] Parse Date Format

查看:28
本文介绍了解析日期格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 DataFrame 包含 日期格式 - yyyyMMddTHH:mm:ss+UTC

数据准备

sparkDF = sql.createDataFrame([(20201021T00:00:00+0530",),(20211011T00:00:00+0530",),(20200212T00:00:00+0300",),(20211021T00:00:00+0530",),(20211021T00:00:00+0900",),(20211021T00:00:00-0500",)],['时间戳'])sparkDF.show(截断=假)+----------------------+|时间戳 |+----------------------+|20201021T00:00:00+0530||20211011T00:00:00+0530||20200212T00:00:00+0300||20211021T00:00:00+0530||20211021T00:00:00+0900||20211021T00:00:00-0500|+----------------------+

我知道日期格式可以解析并将值转换为DateType

时间戳解析

sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()+----------------+|timestamp_parsed|+----------------+|2020-10-21||2021-10-11||空||2021-10-21||空||空|+----------------+

如您所见,它特定于 +0530 字符串,我知道我可以使用多个模式和 coalesce 第一个非空值

多种模式&合并

sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\.withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\.withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\.withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\.drop(*['p1','p2','p3','p4'])\.show(截断=假)+----------------------+----------------+|时间戳 |timestamp_parsed|+----------------------+----------------+|20201021T00:00:00+0530|2020-10-21 ||20211011T00:00:00+0530|2021-10-11 ||20200212T00:00:00+0300|2020-02-12 ||20211021T00:00:00+0530|2021-10-21 ||20211021T00:00:00+0900|2021-10-21 ||20211021T00:00:00-0500|2021-10-21 |+----------------------+----------------+

是否有更好的方法来实现这一点,因为数据源中可能有一堆其他 UTC,Spark 中是否有可用的标准 UTC TZ解析所有情况

解决方案

我认为您将 to_date 函数的第二个参数弄错了,这导致输出中出现空值

时间戳中的 +530 是 Zulu 值,它仅表示当前时间戳相对于 UTC 提前(对于 +)或落后(对于 -)多少小时和分钟.请在此处参考 Basil 的回复

I have the following DataFrame containing the date format - yyyyMMddTHH:mm:ss+UTC

Data Preparation

sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",),
                               ("20211011T00:00:00+0530",),
                               ("20200212T00:00:00+0300",),
                               ("20211021T00:00:00+0530",),
                               ("20211021T00:00:00+0900",),
                               ("20211021T00:00:00-0500",)
                              ]
                              ,['timestamp'])


sparkDF.show(truncate=False)

+----------------------+
|timestamp             |
+----------------------+
|20201021T00:00:00+0530|
|20211011T00:00:00+0530|
|20200212T00:00:00+0300|
|20211021T00:00:00+0530|
|20211021T00:00:00+0900|
|20211021T00:00:00-0500|
+----------------------+

I m aware of the date format to parse and convert the values to DateType

Timestamp Parsed

sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()

+----------------+
|timestamp_parsed|
+----------------+
|      2020-10-21|
|      2021-10-11|
|            null|
|      2021-10-21|
|            null|
|            null|
+----------------+

As you can see , its specific to +0530 strings , I m aware of the fact that I can use multiple patterns and coalesce the first non-null values

Multiple Patterns & Coalesce

sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\
       .withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\
       .withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\
       .withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\
       .withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\
       .drop(*['p1','p2','p3','p4'])\
       .show(truncate=False)

+----------------------+----------------+
|timestamp             |timestamp_parsed|
+----------------------+----------------+
|20201021T00:00:00+0530|2020-10-21      |
|20211011T00:00:00+0530|2021-10-11      |
|20200212T00:00:00+0300|2020-02-12      |
|20211021T00:00:00+0530|2021-10-21      |
|20211021T00:00:00+0900|2021-10-21      |
|20211021T00:00:00-0500|2021-10-21      |
+----------------------+----------------+

Is there a better way to accomplish this, as there might be a bunch of other UTC within the data source, is there a standard UTC TZ available within Spark to parse all the cases

解决方案

i think you have got the 2nd argument of your to_date function wrong which is causing null values in your output

the +530 in your timestamp is the Zulu value which just denotes how many hours and mins ahead (for +) or behind (for -) the current timestamp is withrespect to UTC. Please refer to the response by Basil here Java / convert ISO-8601 (2010-12-16T13:33:50.513852Z) to Date object This link has full details available for the same.

To answer your question if you replace +0530 by Z it should solve your problem.

Here is the spark code in scala that I tried and worked:

val data = Seq("20201021T00:00:00+0530",
                               "20211011T00:00:00+0530",
                               "20200212T00:00:00+0300",
                               "20211021T00:00:00+0530",
                               "20211021T00:00:00+0900",
                               "20211021T00:00:00-0500")

import spark.implicits._
val sparkDF = data.toDF("custom_time")

import org.apache.spark.sql.functions._
val spark_DF2 = sparkDF.withColumn("new_timestamp", to_date($"custom_time", "yyyyMMdd'T'HH:mm:ssZ"))

spark_DF2.show(false)

here is the snapshot of the output. As you can see there are no null values.

这篇关于解析日期格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆