将带有纳秒的字符串转换为 spark 中的时间戳 [英] convert string with nanosecond into timestamp in spark

查看:28
本文介绍了将带有纳秒的字符串转换为 spark 中的时间戳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法将带有纳秒的时间戳值转换为 spark 中的时间戳.我从 csv 文件中获取输入,并且 timstamp 值的格式为12-12-2015 14:09:36.992415+01:00.这是我试过的代码.

Is there a way to convert a timestamp value with nano seconds to timestamp in spark. I get the input from a csv file and the timstamp value is of format 12-12-2015 14:09:36.992415+01:00. This is the code I tried.

val date_raw_data = List((1, "12-12-2015 14:09:36.992415+01:00"))

val dateraw_df = sc.parallelize(date_raw_data).toDF("ID", "TIMESTAMP_VALUE")

val ts = unix_timestamp($"TIMESTAMP_VALUE", "MM-dd-yyyy HH:mm:ss.ffffffz").cast("double").cast("timestamp")

val date_df = dateraw_df.withColumn("TIMESTAMP_CONV", ts).show(false)

输出是

+---+-----------------------+---------------------+
|ID |TIMESTAMP_VALUE        |TIMESTAMP_CONV       |
+---+-----------------------+---------------------+
|1  |12-12-2015 14:09:36.992|null                 |
+---+-----------------------+---------------------+

我能够使用格式MM-dd-yyyy HH:mm:ss.SSS 以毫秒为单位转换时间戳.问题在于纳秒和时区格式.

I was able to convert a time stamp with millisecond using format MM-dd-yyyy HH:mm:ss.SSS. Trouble is with nano second and timezone formats.

推荐答案

unix_timestamp 在这里不起作用.即使您可以解析字符串(AFAIK SimpleDateFormat 不提供所需的格式)、unix_timestamp 有只有第二个精度(强调我的):

unix_timestamp won't do here. Even if you could parse the string (AFAIK SimpleDateFormat doesn't provide required formats), unix_timestamp has only second precision (emphasis mine):

def unix_timestamp(s: Column, p: String): Column

使用给定模式转换时间字符串(参见 [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) 到 Unix 时间戳(以秒为单位),如果失败则返回 null.

Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail.

您必须创建自己的函数来解析这些数据.一个粗略的想法:

You have to create your own function to parse this data. A rough idea:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

def to_nano(c: Column) = {
  val r = "([0-9]{2}-[0-9]{2}-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2})(\\.[0-9]*)(.*)$"
  // seconds part
  (unix_timestamp(
    concat(
      regexp_extract($"TIMESTAMP_VALUE", r, 1),
      regexp_extract($"TIMESTAMP_VALUE", r, 3)
    ), "MM-dd-YYYY HH:mm:ssXXX"
  ).cast("decimal(38, 9)") + 
  // subsecond part
  regexp_extract($"TIMESTAMP_VALUE", r, 2).cast("decimal(38, 9)")).alias("value")
}

Seq("12-12-2015 14:09:36.992415+01:00").toDF("TIMESTAMP_VALUE")
  .select(to_nano($"TIMESTAMP_COLUMN").cast("timestamp"))
  .show(false)

// +--------------------------+
// |value                     |
// +--------------------------+
// |2014-12-28 14:09:36.992415|
// +--------------------------+

这篇关于将带有纳秒的字符串转换为 spark 中的时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆