将包含多种字符串日期格式的列转换为 Spark 中的 DateTime [英] Cast column containing multiple string date formats to DateTime in Spark
问题描述
我的 Spark DataDrame
中有一个日期列,其中包含多种字符串格式.我想将这些转换为 DateTime.
I have a date column in my Spark DataDrame
that contains multiple string formats. I would like to cast these to DateTime.
我专栏中的两种格式是:
The two formats in my column are:
mm/dd/yyyy
;和yyyy-mm-dd
到目前为止,我的解决方案是使用 UDF 更改第一个日期格式以匹配第二个日期格式,如下所示:
My solution so far is to use a UDF to change the first date format to match the second as follows:
import re
def parseDate(dateString):
if re.match('\d{1,2}\/\d{1,2}\/\d{4}', dateString) is not None:
return datetime.strptime(dateString, '%M/%d/%Y').strftime('%Y-%M-%d')
else:
return dateString
# Create Spark UDF based on above function
dateUdf = udf(parseDate)
df = (df.select(to_date(dateUdf(raw_transactions_df['trans_dt']))))
这行得通,但不是那么容错.我特别关注:
This works, but is not all that fault-tolerant. I am specifically concerned about:
- 我还没有遇到过的日期格式.
- 区分
mm/dd/yyyy
和dd/mm/yyyy
(我目前使用的正则表达式显然不能这样做).立>
- Date formats I am yet to encounter.
- Distinguishing between
mm/dd/yyyy
anddd/mm/yyyy
(the regex I'm using clearly doesn't do this at the moment).
有没有更好的方法来做到这一点?
Is there a better way to do this?
推荐答案
我个人建议直接使用 SQL 函数,而无需昂贵且低效的重新格式化:
Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:
from pyspark.sql.functions import coalesce, to_date
def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
# Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
return coalesce(*[to_date(col, f) for f in formats])
这将选择第一种格式,可以成功解析输入字符串.
This will choose the first format, which can successfully parse input string.
用法:
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()
+---+----------+----------+
| id| dt| pdt|
+---+----------+----------+
| 1|01/22/2010|2010-01-22|
| 2|2018-12-01|2018-12-01|
+---+----------+----------+
会比udf
更快,添加新格式只需调整formats
参数即可.
It will be faster than udf
, and adding new formats is just a matter of adjusting formats
parameter.
但是它不会帮助您解决格式歧义.在一般情况下,如果没有人工干预和与外部数据的交叉引用,可能无法做到这一点.
However it won't help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.
同样的事情当然可以在 Scala 中完成:
The same thing can be of course done in Scala:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}
def to_date_(col: Column,
formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
这篇关于将包含多种字符串日期格式的列转换为 Spark 中的 DateTime的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!