从Pyspark中的字符串列创建日期时间 [英] Creating datetime from string column in Pyspark

查看:140
本文介绍了从Pyspark中的字符串列创建日期时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下日期时间列,如下所示.我想将字符串中的列转换为日期时间类型,以便提取月份,日期和年份等.

Suppose I have the following datetime column as shown below. I want to convert the column in string to a datetime type so I can extract months, days and year and such.

+---+------------+
|agg|    datetime|
+---+------------+
|  A|1/2/17 12:00|
|  B|        null|
|  C|1/4/17 15:00|
+---+------------+

我已经在下面尝试了以下代码,但是datetime列中的返回值为null,目前我不了解其原因.

I have tried the following code below, but the returning values in the datetime column are nulls, which I don't understand the reason for this at the moment.

df.select(df['datetime'].cast(DateType())).show()

我也尝试过以下代码:

df = df.withColumn('datetime2', from_unixtime(unix_timestamp(df['datetime']), 'dd/MM/yy HH:mm'))

但是,它们都产生了这个数据帧:

But, they both produce this dataframe:

+---+------------+---------+
|agg|    datetime|datetime2|
+---+------------+---------+
|  A|1/2/17 12:00|     null|
|  B|       null |     null|
|  C|1/4/17 12:00|     null|

我已经阅读并尝试了这篇文章中指定的解决方案,但无济于事:

I have already read and tried the solution as specified in this post to no avail: PySpark dataframe convert unusual string format to Timestamp

推荐答案

// imports
import org.apache.spark.sql.functions.{dayofmonth,from_unixtime,month, unix_timestamp, year}

// Not sure if the datatype of the column is datetime or string
// I assume the column might be string, do the conversion
// created column datetime2 which is time stamp
val df2 = df.withColumn("datetime2", from_unixtime(unix_timestamp(df("datetime"), "dd/MM/yy HH:mm")))

+---+------------+-------------------+
|agg|    datetime|          datetime2|
+---+------------+-------------------+
|  A|1/2/17 12:00|2017-02-01 12:00:00|
|  B|        null|               null|
|  C|1/4/17 15:00|2017-04-01 15:00:00|
+---+------------+-------------------+


//extract month, year, day information
val df3 = df2.withColumn("month", month(df2("datetime2")))
  .withColumn("year", year(df2("datetime2")))
  .withColumn("day", dayofmonth(df2("datetime2")))
+---+------------+-------------------+-----+----+----+
|agg|    datetime|          datetime2|month|year| day|
+---+------------+-------------------+-----+----+----+
|  A|1/2/17 12:00|2017-02-01 12:00:00|    2|2017|   1|
|  B|        null|               null| null|null|null|
|  C|1/4/17 15:00|2017-04-01 15:00:00|    4|2017|   1|
+---+------------+-------------------+-----+----+----+

谢谢

这篇关于从Pyspark中的字符串列创建日期时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆