在Spark中将包含多种字符串日期格式的列转换为DateTime [英] Cast column containing multiple string date formats to DateTime in Spark

查看：1213 发布时间：2020/9/3 23:40:10 python apache-spark pyspark apache-spark-sql

本文介绍了在Spark中将包含多种字符串日期格式的列转换为DateTime的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的Spark DataDrame中有一个日期列，其中包含多种字符串格式.我想将它们投射到DateTime.

I have a date column in my Spark DataDrame that contains multiple string formats. I would like to cast these to DateTime.

我的栏中的两种格式是:

The two formats in my column are:

mm/dd/yyyy;和
yyyy-mm-dd

mm/dd/yyyy; and
yyyy-mm-dd

到目前为止，我的解决方案是使用UDF将第一个日期格式更改为与第二个日期格式匹配，如下所示:

My solution so far is to use a UDF to change the first date format to match the second as follows:

import re

def parseDate(dateString):
    if re.match('\d{1,2}\/\d{1,2}\/\d{4}', dateString) is not None:
        return datetime.strptime(dateString, '%M/%d/%Y').strftime('%Y-%M-%d')
    else:
        return dateString

# Create Spark UDF based on above function
dateUdf = udf(parseDate)

df = (df.select(to_date(dateUdf(raw_transactions_df['trans_dt']))))

这可行，但并不是所有的容错功能.我特别担心:

This works, but is not all that fault-tolerant. I am specifically concerned about:

我还没有遇到过的日期格式.
在mm/dd/yyyy和dd/mm/yyyy之间进行区分(我正在使用的正则表达式目前尚无法做到这一点).

Date formats I am yet to encounter.
Distinguishing between mm/dd/yyyy and dd/mm/yyyy (the regex I'm using clearly doesn't do this at the moment).

有更好的方法吗?

推荐答案

我个人建议直接使用SQL函数，而不必进行昂贵且效率低的重新格式化:

Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:

from pyspark.sql.functions import coalesce, to_date

def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
    # Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
    return coalesce(*[to_date(col, f) for f in formats])

这将选择第一种格式，该格式可以成功解析输入字符串.

This will choose the first format, which can successfully parse input string.

用法:

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()

+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
+---+----------+----------+

它将比udf更快，并且添加新格式仅是调整formats参数的问题.

It will be faster than udf, and adding new formats is just a matter of adjusting formats parameter.

但是，它不能帮助您解决格式歧义问题.在一般情况下，如果没有人工干预和与外部数据的交叉引用，可能无法做到这一点.

However it won't help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.

当然可以在Scala中完成同一件事:

The same thing can be of course done in Scala:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}

def to_date_(col: Column, 
             formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
  coalesce(formats.map(f => to_date(col, f)): _*)
}

这篇关于在Spark中将包含多种字符串日期格式的列转换为DateTime的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Spark中将包含多种字符串日期格式的列转换为DateTime [英] Cast column containing multiple string date formats to DateTime in Spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Spark中将包含多种字符串日期格式的列转换为DateTime [英] Cast column containing multiple string date formats to DateTime in Spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭