PySpark-列中的to_date格式 [英] PySpark - to_date format from column
问题描述
我目前正在尝试找出如何通过列参数将String-format参数传递给to_date pyspark函数.
I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
具体地说,我有以下设置:
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
我目前正在尝试添加一个新列,其中将F.col("value")列中的每个日期(它是一个字符串值)解析为一个日期.
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
对于每种格式,可以分别使用
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
但是这给了我2个新列-但我希望有1个列同时包含两个结果-但使用to_date函数似乎无法调用该列:
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
在这里抛出错误列对象不可调用".
Here an error "Column object not callable" is being thrown.
是否可以对所有可能的格式采用通用方法(这样我就不必为每种格式手动添加新列)?
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
推荐答案
您可以使用一列值作为参数,而无需使用spark-sql语法的udf
:
You can use a column value as a parameter without a udf
using the spark-sql syntax:
Spark 2.2及更高版本
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
或等效地使用pyspark-sql:
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark 1.5及更高版本
旧版本的spark不支持对to_date
函数使用format
参数,因此您必须使用unix_timestamp
和from_unixtime
:
Older versions of spark do not support having a format
argument to the to_date
function, so you'll have to use unix_timestamp
and from_unixtime
:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
或等效地使用pyspark-sql:
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
这篇关于PySpark-列中的to_date格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!