PySpark-列中的to_date格式 [英] PySpark - to_date format from column

查看:668
本文介绍了PySpark-列中的to_date格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试找出如何通过列参数将String-format参数传递给to_date pyspark函数.

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.

具体地说,我有以下设置:

Specifically, I have the following setup:

sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
                      ('b','2018-02-02','yyyy-MM-dd'),
                      ('c','02-02-2018','dd-MM-yyyy')]).toDF(
                    ["col_name","value","format"])

我目前正在尝试添加一个新列,其中将F.col("value")列中的每个日期(它是一个字符串值)解析为一个日期.

I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.

对于每种格式,可以分别使用

Separately for each format, this can be done with

df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
        withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))

但是这给了我2个新列-但我希望有1个列同时包含两个结果-但使用to_date函数似乎无法调用该列:

This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:

df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))

在这里抛出错误列对象不可调用".

Here an error "Column object not callable" is being thrown.

是否可以对所有可能的格式采用通用方法(这样我就不必为每种格式手动添加新列)?

Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?

推荐答案

您可以使用一列值作为参数,而无需使用spark-sql语法的udf:

You can use a column value as a parameter without a udf using the spark-sql syntax:

Spark 2.2及更高版本

from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name|     value|    format|     test3|
#+--------+----------+----------+----------+
#|       a|2018-01-01|yyyy-MM-dd|2018-01-01|
#|       b|2018-02-02|yyyy-MM-dd|2018-02-02|
#|       c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+

或等效地使用pyspark-sql:

Or equivalently using pyspark-sql:

df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show() 

Spark 1.5及更高版本

旧版本的spark不支持对to_date函数使用format参数,因此您必须使用unix_timestampfrom_unixtime:

Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:

from pyspark.sql.functions import expr
df.withColumn(
    "test3",
    expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()

或等效地使用pyspark-sql:

Or equivalently using pyspark-sql:

df.createOrReplaceTempView("df")
spark.sql(
    "select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show() 

这篇关于PySpark-列中的to_date格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆