将纳秒级的字符串转换为spark中的时间戳 [英] convert string with nanosecond into timestamp in spark
问题描述
是否有一种方法可以将纳秒级的时间戳值转换为spark中的时间戳.我从csv文件获取输入,并且timstamp值的格式为
12-12-2015 14:09:36.992415+01:00
.这是我尝试的代码.
Is there a way to convert a timestamp value with nano seconds to timestamp in spark. I get the input from a csv file and the timstamp value is of format
12-12-2015 14:09:36.992415+01:00
. This is the code I tried.
val date_raw_data = List((1, "12-12-2015 14:09:36.992415+01:00"))
val dateraw_df = sc.parallelize(date_raw_data).toDF("ID", "TIMESTAMP_VALUE")
val ts = unix_timestamp($"TIMESTAMP_VALUE", "MM-dd-yyyy HH:mm:ss.ffffffz").cast("double").cast("timestamp")
val date_df = dateraw_df.withColumn("TIMESTAMP_CONV", ts).show(false)
输出为
+---+-----------------------+---------------------+
|ID |TIMESTAMP_VALUE |TIMESTAMP_CONV |
+---+-----------------------+---------------------+
|1 |12-12-2015 14:09:36.992|null |
+---+-----------------------+---------------------+
我能够使用MM-dd-yyyy HH:mm:ss.SSS
格式将时间戳转换为毫秒.麻烦的是纳秒和时区格式.
I was able to convert a time stamp with millisecond using format MM-dd-yyyy HH:mm:ss.SSS
. Trouble is with nano second and timezone formats.
推荐答案
unix_timestamp
在这里不做.即使您可以解析字符串(AFAIK SimpleDateFormat
不提供必需的格式),unix_timestamp
unix_timestamp
won't do here. Even if you could parse the string (AFAIK SimpleDateFormat
doesn't provide required formats), unix_timestamp
has only second precision (emphasis mine):
def unix_timestamp(s: Column, p: String): Column
使用给定的模式转换时间字符串(请参见[ http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html ])转换为Unix时间戳(以秒为单位),如果失败,则返回null.
Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail.
您必须创建自己的函数来解析此数据.一个大概的想法:
You have to create your own function to parse this data. A rough idea:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def to_nano(c: Column) = {
val r = "([0-9]{2}-[0-9]{2}-[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2})(\\.[0-9]*)(.*)$"
// seconds part
(unix_timestamp(
concat(
regexp_extract($"TIMESTAMP_VALUE", r, 1),
regexp_extract($"TIMESTAMP_VALUE", r, 3)
), "MM-dd-YYYY HH:mm:ssXXX"
).cast("decimal(38, 9)") +
// subsecond part
regexp_extract($"TIMESTAMP_VALUE", r, 2).cast("decimal(38, 9)")).alias("value")
}
Seq("12-12-2015 14:09:36.992415+01:00").toDF("TIMESTAMP_VALUE")
.select(to_nano($"TIMESTAMP_COLUMN").cast("timestamp"))
.show(false)
// +--------------------------+
// |value |
// +--------------------------+
// |2014-12-28 14:09:36.992415|
// +--------------------------+
这篇关于将纳秒级的字符串转换为spark中的时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!