如何在 Spark 中对具有日期和时间值的列进行排序? [英] How to sort a column with Date and time values in Spark?

查看:41
本文介绍了如何在 Spark 中对具有日期和时间值的列进行排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:我有这个作为火花中的数据框.此时间/日期值构成数据框中的单个列.

Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe.

输入:

04-NOV-16 03.36.13.000000000 PM
06-NOV-15 03.42.21.000000000 PM
05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 上午

04-NOV-16 03.36.13.000000000 PM
06-NOV-15 03.42.21.000000000 PM
05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM

预期输出:

05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM
06-NOV-15 03.42.21.000000000 PM
04-NOV-16 03.36.13.000000000 PM

推荐答案

由于这种格式不标准,需要使用unix_timestamp函数解析字符串并转换成时间戳类型:

As this format is not standard, you need to use the unix_timestamp function to parse the string and convert into a timestamp type:

import org.apache.spark.sql.functions._

// Example data
val df = Seq(
  Tuple1("04-NOV-16 03.36.13.000000000 PM"),
  Tuple1("06-NOV-15 03.42.21.000000000 PM"),
  Tuple1("05-NOV-15 03.32.05.000000000 PM"),
  Tuple1("06-NOV-15 03.32.14.000000000 AM")
).toDF("stringCol")

// Timestamp pattern found in string
val pattern = "dd-MMM-yy hh.mm.ss.S a"

// Creating new DataFrame and ordering
val newDF = df
  .withColumn("timestampCol", unix_timestamp(df("stringCol"), pattern).cast("timestamp"))
  .orderBy("timestampCol")

newDF.show(false)

结果:

+-------------------------------+---------------------+
|stringCol                      |timestampCol         |
+-------------------------------+---------------------+
|05-NOV-15 03.32.05.000000000 PM|2015-11-05 15:32:05.0|
|06-NOV-15 03.32.14.000000000 AM|2015-11-06 03:32:14.0|
|06-NOV-15 03.42.21.000000000 PM|2015-11-06 15:42:21.0|
|04-NOV-16 03.36.13.000000000 PM|2016-11-04 15:36:13.0|
+-------------------------------+---------------------+

可以找到有关 unix_timestamp 和其他实用程序函数的更多信息 这里.

More about the unix_timestamp and other utility functions can be found here.

关于时间戳格式的构建,可以参考SimpleDateFormatter 文档

For building the timestamp format, one can refer to the SimpleDateFormatter docs

编辑 1: 如 pheeleeppoo 所说,您可以直接按表达式排序,而不是创建新列,假设您只想在数据框中保留字符串类型的列:

Edit 1: as said by pheeleeppoo, you could order directly by the expression, instead of creating a new column, assuming you want to keep only the string-typed column in your dataframe:

val newDF = df.orderBy(unix_timestamp(df("stringCol"), pattern).cast("timestamp"))

<小时>

请注意unix_timestamp函数的精度以秒为单位,所以如果毫秒真的很重要,可以使用udf:


Edit 2: Please note that the precision of the unix_timestamp function is in seconds, so if the milliseconds are really important, an udf can be used:

def myUDF(p: String) = udf(
  (value: String) => {
    val dateFormat = new SimpleDateFormat(p)
    val parsedDate = dateFormat.parse(value)
    new java.sql.Timestamp(parsedDate.getTime())
  }
)

val pattern = "dd-MMM-yy hh.mm.ss.S a"
val newDF = df.withColumn("timestampCol", myUDF(pattern)(df("stringCol"))).orderBy("timestampCol")

这篇关于如何在 Spark 中对具有日期和时间值的列进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆