将日期分为年,月和日,分隔符不一致 [英] Splitting Date into Year, Month and Day, with inconsistent delimiters

查看:96
本文介绍了将日期分为年,月和日,分隔符不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将我的日期列(现在是字符串类型)拆分为3列年,月和日期.我使用(PySpark):

I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark):

split_date=pyspark.sql.functions.split(df['Date'], '-')     
df= df.withColumn('Year', split_date.getItem(0))
df= df.withColumn('Month', split_date.getItem(1))
df= df.withColumn('Day', split_date.getItem(2))

我遇到了一个问题,因为我的一半约会用'-'分隔,另一半用'/'分隔.我如何使用和或操作根据用例将日期分隔为-"或"/".另外,当其以'/'分隔时,格式为mm/dd/yyyy,而当以'-'分隔时,格式为yyyy-mm-dd.

I run into an issue, because half my dates are separated by '-' and the other half are separated by '/'. How can I use and or operation to split the Date by either '-' or '/' depending on the use case. Additionaly, when its separated by '/', the format is mm/dd/yyyy and when separated by '-', the format is yyyy-mm-dd.

我希望将日期"列分为日,月和年.

I want the Date column to be separated into Day, Month and Year.

推荐答案

您只需要一点点额外的编码即可识别日期格式的类型.例如,假设您的数据采用以下格式-

You just need little bit of extra coding to recognize type of date format . for example, lets say your data is in below format -

data = [("2008-05-01",1),("2018-01-01",2),("03/14/2017",3),("01/01/2018",4)]
df = spark.createDataFrame(data,schema=['date','key'])

df.show()

:

+----------+---+
|      date|key|
+----------+---+
|2008-05-01|  1|
|2018-01-01|  2|
|03/14/2017|  3|
|01/01/2018|  4|
+----------+---+

:

from pyspark.sql.functions import *
from pyspark.sql.types import *

# udf that recognise pattern and return list of year,month and day
def splitUDF(row):
    if "/" in row:
        mm,dd,yyyy = row.split("/")
    elif "-" in row:
        yyyy,mm,dd = row.split("-")

    return [yyyy,mm,dd]


datSplitterUDF = udf(lambda row : splitUDF(row),ArrayType(StringType()))
df\
.select(datSplitterUDF(df.date).alias("dt"))\
.withColumn('year',col('dt').getItem(0).cast('int'))\
.withColumn('month',col('dt').getItem(1).cast('int'))\
.withColumn('day',col('dt').getItem(2).cast('int'))\
.show()

输出:

+--------------+----+-----+---+
|            dt|year|month|day|
+--------------+----+-----+---+
|[2008, 05, 01]|2008|    5|  1|
|[2018, 01, 01]|2018|    1|  1|
|[2017, 03, 14]|2017|    3| 14|
|[2018, 01, 01]|2018|    1|  1|
+--------------+----+-----+---+

这篇关于将日期分为年,月和日,分隔符不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆