Spark时间戳差异 [英] Spark timestamp difference

查看:186
本文介绍了Spark时间戳差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Spark中做一个时间戳差异,但它没有按预期工作.

I am trying to do a timestamp difference in Spark and it is not working as expected.

以下是我要尝试的方式

import org.apache.spark.sql.functions.*
df = df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss"))

TimeStampHigh - 15:57:01
TimeStampLow - 00:11:57

它向我返回 10:45:04 的结果预期输出- 15:45:04

It returns me a result of 10:45:04 Expected output - 15:45:04

我的另一种选择是使用具有Java实现的UDF.

My other alternative is to go to an UDF with Java implementation.

任何指针都会有所帮助.

Any pointers will help.

推荐答案

这是因为

That's because from_unixtime (emphasis mine):

以给定格式将从unix纪元(1970-01-01 00:00:00 UTC)的秒数转换为表示该时刻的时间戳记在当前系统时区中的时间

很显然,您的系统或JVM未配置为使用UTC时间.

Clearly your system or JVM is not configured to use UTC time.

您应该执行以下一项操作:

You should do one of the following:

  • 配置JVM以对 spark.executor.extraJavaOptions spark.driver.extraJavaOptions 使用适当的时区( -Duser.timezone = UTC >).
  • 设置 spark.sql.session.timeZone 以使用适当的时区.
  • Configure JVM to use appropriate time zone (-Duser.timezone=UTC for both spark.executor.extraJavaOptions and spark.driver.extraJavaOptions).
  • Set spark.sql.session.timeZone to use appropriate time zone.

示例:

scala> val df = Seq(("15:57:01", "00:11:57")).toDF("TimeStampHigh", "TimeStampLow")
df: org.apache.spark.sql.DataFrame = [TimeStampHigh: string, TimeStampLow: string]

scala> spark.conf.set("spark.sql.session.timeZone", "GMT-5")  // Equivalent to your current settings

scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
|     15:57:01|    00:11:57|     10:45:04|
+-------------+------------+-------------+


scala> spark.conf.set("spark.sql.session.timeZone", "UTC")  // With UTC

scala> df.withColumn("TimeStampDiff", from_unixtime(unix_timestamp(df.col("TimeStampHigh"), "HH:mm:ss").minus(unix_timestamp(df.col("TimeStampLow"), "HH:mm:ss")),"HH:mm:ss")).show
+-------------+------------+-------------+
|TimeStampHigh|TimeStampLow|TimeStampDiff|
+-------------+------------+-------------+
|     15:57:01|    00:11:57|     15:45:04|
+-------------+------------+-------------+

这篇关于Spark时间戳差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆