Impala时间戳与Hive不匹配 - 时区问题? [英] Impala timestamps don't match Hive - a timezone issue?

查看:2443
本文介绍了Impala时间戳与Hive不匹配 - 时区问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些HDFS中的事件日志数据,它的原始格式如下所示:

  2015-11-05 19:36:25.764 INFO [... etc ...] 

一个外部表格指向这个HDFS位置:

  CREATE EXTERNAL TABLE`log_stage`(
`event_time` timestamp,
[..] 。')
ROW FORMAT DELIMITED
终止于'\'
的行'\\\
'
作为INPUTFORMAT存储
'org.apache。 hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

为了提高性能,我们希望在Impala中进行查询。通过执行Hive查询将 log_stage 数据插入Hive / Impala Parquet支持的表中: INSERT INTO TABLE log SELECT * FROM log_stage 。以下是Parquet表的DDL:

  CREATE TABLE`log`(
`event_time` timestamp,
[...])
行格式SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
存入输入格式
'org .apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

问题:在Impala中查询时,时间戳会提前7小时:

 蜂巢时间:2015-11-05 19:36:25.764 
飞羚时间:2015-11-06 02:36:25.764

> as.POSIXct(2015-11-06 02:36:25) - as.POSIXct(2015-11-05 19:36:25)
时差7小时

注意:服务器的时区(从 / etc / sysconfig / clock )都设置为美国/丹佛,它目前比UTC时间晚7小时。



似乎Impala正在采取已经在UTC的事件,错误地假设他们在美国/丹佛时间,并增加了7个小时。



你知道如何同步时间,以便Impala表匹配Hive表吗?

解决方案

Hive以不同方式将时间戳写入Parquet。您可以使用impalad标志 -convert_legacy_hive_parquet_utc_timestamps 来通知Impala在读取时进行转换。有关更多详细信息,请参阅 TIMESTAMP文档



此博客文章对此问题进行了简要说明:


当Hive将时间戳值存储为Parquet格式时,它会转换本地时间转换为UTC时间,当它读取数据时,它将转换回当地时间。另一方面,Impala在读取时间戳记字段时不会进行转换,因此,UTC时间将返回而不是本地时间。


impalad标志告诉Impala在阅读由Hive生成的Parquet 中的时间戳时进行转换。它确实会产生一些小的成本,所以如果这对您是一个问题(尽管它可能很小),您应该考虑使用Impala编写时间戳。


I have some eventlog data in HDFS that, in its raw format, looks like this:

2015-11-05 19:36:25.764 INFO    [...etc...]

An external table points to this HDFS location:

CREATE EXTERNAL TABLE `log_stage`(
  `event_time` timestamp, 
  [...])
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

For performance, we'd like to query this in Impala. The log_stage data is inserted into a Hive/Impala Parquet-backed table by executing a Hive query: INSERT INTO TABLE log SELECT * FROM log_stage. Here's the DDL for the Parquet table:

CREATE TABLE `log`(
  `event_time` timestamp,
  [...])
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

The problem: when queried in Impala, the timestamps are 7 hours ahead:

Hive time:   2015-11-05 19:36:25.764
Impala time: 2015-11-06 02:36:25.764

> as.POSIXct("2015-11-06 02:36:25") - as.POSIXct("2015-11-05 19:36:25")
Time difference of 7 hours

Note: The timezone of the servers (from /etc/sysconfig/clock) are all set to "America/Denver", which is currently 7 hours behind UTC.

It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours.

Do you know how to sync the times so that the Impala table matches the Hive table?

解决方案

Hive writes timestamps to Parquet differently. You can use the impalad flag -convert_legacy_hive_parquet_utc_timestamps to tell Impala to do the conversion on read. See the TIMESTAMP documentation for more details.

This blog post has a brief description of the issue:

When Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time. Impala, however on the other hand, does no conversion when reads the timestamp field out, hence, UTC time is returned instead of local time.

The impalad flag tells Impala to do the conversion when reading timestamps in Parquet produced by Hive. It does incur some small cost, so you should consider writing your timestamps with Impala if that is an issue for you (though it likely is minimal).

这篇关于Impala时间戳与Hive不匹配 - 时区问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆