Spark时间戳差异 [英] Spark timestamp difference
本文介绍了Spark时间戳差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试在Spark中做一个时间戳差异,但它没有按预期工作.
I am trying to do a timestamp difference in Spark and it is not working as expected.
以下是我要尝试的方式
import org.apache.spark.sql.functions.*
df = df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss"))
值
TimeStampHigh - 15:57:01
TimeStampLow - 00:11:57
它向我返回 10:45:04
的结果预期输出- 15:45:04
It returns me a result of 10:45:04
Expected output - 15:45:04
我的另一种选择是使用具有Java实现的UDF.
My other alternative is to go to an UDF with Java implementation.
任何指针都会有所帮助.
Any pointers will help.
推荐答案
That's because from_unixtime
(emphasis mine):
以给定格式将从unix纪元(1970-01-01 00:00:00 UTC)的秒数转换为表示该时刻的时间戳记在当前系统时区中的时间
很显然,您的系统或JVM未配置为使用UTC时间.
Clearly your system or JVM is not configured to use UTC time.
您应该执行以下一项操作:
You should do one of the following:
- 配置JVM以对
spark.executor.extraJavaOptions
和spark.driver.extraJavaOptions
使用适当的时区(-Duser.timezone = UTC
>). - 设置
spark.sql.session.timeZone
以使用适当的时区.
- Configure JVM to use appropriate time zone (
-Duser.timezone=UTC
for bothspark.executor.extraJavaOptions
andspark.driver.extraJavaOptions
). - Set
spark.sql.session.timeZone
to use appropriate time zone.
示例:
scala> val df = Seq(("15:57:01", "00:11:57")).toDF("TimeStampHigh", "TimeStampLow")
df: org.apache.spark.sql.DataFrame = [TimeStampHigh: string, TimeStampLow: string]
scala> spark.conf.set("spark.sql.session.timeZone", "GMT-5") // Equivalent to your current settings
scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
| 15:57:01| 00:11:57| 10:45:04|
+-------------+------------+-------------+
scala> spark.conf.set("spark.sql.session.timeZone", "UTC") // With UTC
scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
| 15:57:01| 00:11:57| 15:45:04|
+-------------+------------+-------------+
这篇关于Spark时间戳差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文