更改 aSspark 数据框中列值的日期格式 [英] Changing the date format of the column values in aSspark dataframe
问题描述
我正在将 Excel 工作表读入 Spark 2.0 中的 Dataframe
,然后尝试将一些带有 MM/DD/YY
格式的日期值的列转换为 YYYY-MM-DD
格式.值采用字符串格式.以下是示例:
I am reading a Excel sheet into a Dataframe
in Spark 2.0 and then trying to convert some columns with date values in MM/DD/YY
format into YYYY-MM-DD
format. The values are in string format. Below is the sample:
+---------------+--------------+
|modified | created |
+---------------+--------------+
| null| 12/4/17 13:45|
| 2/20/18| 2/2/18 20:50|
| 3/20/18| 2/2/18 21:10|
| 2/20/18| 2/2/18 21:23|
| 2/28/18|12/12/17 15:42|
| 1/25/18| 11/9/17 13:10|
| 1/29/18| 12/6/17 10:07|
+---------------+--------------+
我希望将其转换为:
+---------------+-----------------+
|modified | created |
+---------------+-----------------+
| null| 2017-12-04 13:45|
| 2018-02-20| 2018-02-02 20:50|
| 2018-03-20| 2018-02-02 21:10|
| 2018-02-20| 2018-02-02 21:23|
| 2018-02-28| 2017-12-12 15:42|
| 2018-01-25| 2017-11-09 13:10|
| 2018-01-29| 2017-12-06 10:07|
+---------------+-----------------+
所以我尝试这样做:
df.withColumn("modified",date_format(col("modified"),"yyyy-MM-dd"))
.withColumn("created",to_utc_timestamp(col("created"),"America/New_York"))
但它给了我结果中的所有 NULL
值.我不确定我哪里出错了.我知道 to_utc_timestamp
上的 created
会将整个时间戳转换为 UTC.理想情况下,我想保持时间不变,只更改日期格式.有没有办法实现我想要做的事情?我哪里出错了?
But it gives me all NULL
values in my result. I am not sure where I am going wrong. I know that to_utc_timestamp
on created
will convert the whole timestamp into UTC. Ideally I would like to keep the time unchanged and only change the date format. Is there a way to achieve what I am trying to do? and Where am I going wrong?
任何帮助将不胜感激.谢谢.
Any help would be appreciated. Thank you.
推荐答案
spark >= 2.2.0
您需要额外的 to_date
和 to_timestamp
内置函数 as
import org.apache.spark.sql.functions._
df.withColumn("modified",date_format(to_date(col("modified"), "MM/dd/yy"), "yyyy-MM-dd"))
.withColumn("created",to_utc_timestamp(to_timestamp(col("created"), "MM/dd/yy HH:mm"), "UTC"))
你应该有
+----------+-------------------+
|modified |created |
+----------+-------------------+
|null |2017-12-04 13:45:00|
|2018-02-20|2018-02-02 20:50:00|
|2018-03-20|2018-02-02 21:10:00|
|2018-02-20|2018-02-02 21:23:00|
|2018-02-28|2017-12-12 15:42:00|
|2018-01-25|2017-11-09 13:10:00|
|2018-01-29|2017-12-06 10:07:00|
+----------+-------------------+
使用 utc
时区并没有改变我的时间
Use of utc
timezone didn't alter the time for me
import org.apache.spark.sql.functions._
val temp = df.withColumn("modified", from_unixtime(unix_timestamp(col("modified"), "MM/dd/yy"), "yyyy-MM-dd"))
.withColumn("created", to_utc_timestamp(unix_timestamp(col("created"), "MM/dd/yy HH:mm").cast(TimestampType), "UTC"))
输出数据帧同上
这篇关于更改 aSspark 数据框中列值的日期格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!