更改asspark数据框中的列值的日期格式 [英] Changing the date format of the column values in aSspark dataframe

查看:100
本文介绍了更改asspark数据框中的列值的日期格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Excel工作表读取为Spark 2.0中的Dataframe,然后尝试将具有MM/DD/YY格式的日期值的某些列转换为YYYY-MM-DD格式. 值是字符串格式.下面是示例:

I am reading a Excel sheet into a Dataframe in Spark 2.0 and then trying to convert some columns with date values in MM/DD/YY format into YYYY-MM-DD format. The values are in string format. Below is the sample:

+---------------+--------------+
|modified       |      created |
+---------------+--------------+
|           null| 12/4/17 13:45|
|        2/20/18|  2/2/18 20:50|
|        3/20/18|  2/2/18 21:10|
|        2/20/18|  2/2/18 21:23|
|        2/28/18|12/12/17 15:42| 
|        1/25/18| 11/9/17 13:10|
|        1/29/18| 12/6/17 10:07| 
+---------------+--------------+

我希望将其转换为:

+---------------+-----------------+
|modified       |      created    |
+---------------+-----------------+
|           null| 2017-12-04 13:45|
|     2018-02-20| 2018-02-02 20:50|
|     2018-03-20| 2018-02-02 21:10|
|     2018-02-20| 2018-02-02 21:23|
|     2018-02-28| 2017-12-12 15:42| 
|     2018-01-25| 2017-11-09 13:10|
|     2018-01-29| 2017-12-06 10:07| 
+---------------+-----------------+

所以我尝试做:

 df.withColumn("modified",date_format(col("modified"),"yyyy-MM-dd"))
   .withColumn("created",to_utc_timestamp(col("created"),"America/New_York"))

但是它为我提供了结果中的所有NULL值.我不确定我要去哪里错.我知道created上的to_utc_timestamp会将整个时间戳转换为UTC.理想情况下,我想保持时间不变,只更改日期格式.有没有办法实现我想要做的事情?我哪里出问题了?

But it gives me all NULL values in my result. I am not sure where I am going wrong. I know that to_utc_timestamp on created will convert the whole timestamp into UTC. Ideally I would like to keep the time unchanged and only change the date format. Is there a way to achieve what I am trying to do? and Where am I going wrong?

任何帮助将不胜感激.谢谢.

Any help would be appreciated. Thank you.

推荐答案

火花> = 2.2.0

您需要附加的to_dateto_timestamp 内置函数

spark >= 2.2.0

You need addtional to_date and to_timestamp inbuilt functions as

import org.apache.spark.sql.functions._
df.withColumn("modified",date_format(to_date(col("modified"), "MM/dd/yy"), "yyyy-MM-dd"))
  .withColumn("created",to_utc_timestamp(to_timestamp(col("created"), "MM/dd/yy HH:mm"), "UTC"))

你应该有

+----------+-------------------+
|modified  |created            |
+----------+-------------------+
|null      |2017-12-04 13:45:00|
|2018-02-20|2018-02-02 20:50:00|
|2018-03-20|2018-02-02 21:10:00|
|2018-02-20|2018-02-02 21:23:00|
|2018-02-28|2017-12-12 15:42:00|
|2018-01-25|2017-11-09 13:10:00|
|2018-01-29|2017-12-06 10:07:00|
+----------+-------------------+

使用utc时区对我来说并没有改变时间

Use of utc timezone didn't alter the time for me

import org.apache.spark.sql.functions._
val temp = df.withColumn("modified", from_unixtime(unix_timestamp(col("modified"), "MM/dd/yy"), "yyyy-MM-dd"))
  .withColumn("created", to_utc_timestamp(unix_timestamp(col("created"), "MM/dd/yy HH:mm").cast(TimestampType), "UTC"))

输出数据框与上面相同

这篇关于更改asspark数据框中的列值的日期格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆