将包含多种字符串日期格式的列转换为 Spark 中的 DateTime [英] Cast column containing multiple string date formats to DateTime in Spark

查看:17
本文介绍了将包含多种字符串日期格式的列转换为 Spark 中的 DateTime的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 Spark DataDrame 中有一个日期列,其中包含多种字符串格式.我想将这些转换为 DateTime.

I have a date column in my Spark DataDrame that contains multiple string formats. I would like to cast these to DateTime.

我专栏中的两种格式是:

The two formats in my column are:

  • mm/dd/yyyy;和
  • yyyy-mm-dd

到目前为止,我的解决方案是使用 UDF 更改第一个日期格式以匹配第二个日期格式,如下所示:

My solution so far is to use a UDF to change the first date format to match the second as follows:

import re

def parseDate(dateString):
    if re.match('\d{1,2}\/\d{1,2}\/\d{4}', dateString) is not None:
        return datetime.strptime(dateString, '%M/%d/%Y').strftime('%Y-%M-%d')
    else:
        return dateString

# Create Spark UDF based on above function
dateUdf = udf(parseDate)

df = (df.select(to_date(dateUdf(raw_transactions_df['trans_dt']))))

这行得通,但不是那么容错.我特别关注:

This works, but is not all that fault-tolerant. I am specifically concerned about:

  • 我还没有遇到过的日期格式.
  • 区分 mm/dd/yyyydd/mm/yyyy(我目前使用的正则表达式显然不能这样做).
  • Date formats I am yet to encounter.
  • Distinguishing between mm/dd/yyyy and dd/mm/yyyy (the regex I'm using clearly doesn't do this at the moment).

有没有更好的方法来做到这一点?

Is there a better way to do this?

推荐答案

我个人建议直接使用 SQL 函数,而无需昂贵且低效的重新格式化:

Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:

from pyspark.sql.functions import coalesce, to_date

def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
    # Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
    return coalesce(*[to_date(col, f) for f in formats])

这将选择第一种格式,可以成功解析输入字符串.

This will choose the first format, which can successfully parse input string.

用法:

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()

+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
+---+----------+----------+

会比udf更快,添加新格式只需调整formats参数即可.

It will be faster than udf, and adding new formats is just a matter of adjusting formats parameter.

但是它不会帮助您解决格式歧义.在一般情况下,如果没有人工干预和与外部数据的交叉引用,可能无法做到这一点.

However it won't help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.

同样的事情当然可以在 Scala 中完成:

The same thing can be of course done in Scala:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}

def to_date_(col: Column, 
             formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
  coalesce(formats.map(f => to_date(col, f)): _*)
}

这篇关于将包含多种字符串日期格式的列转换为 Spark 中的 DateTime的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆