TIMESTAMP 与蜂巢中的镶木地板的行为不符 [英] TIMESTAMP not behaving as intended with parquet in hive

查看:14
本文介绍了TIMESTAMP 与蜂巢中的镶木地板的行为不符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 parquet 数据,当使用 spark 读取(TIMESTAMP 列)时,这些数据工作得非常好.以下是示例记录:

I have parquet data which when read (TIMESTAMP column) using spark works perfectly fine. Below are the sample records:

scala> data.select($"BATCH_LOAD_DATE").limit(5).show()

+-------------------+
|    BATCH_LOAD_DATE|
+-------------------+
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
+-------------------+   

我检查了架构,它显示 TIMESTAMP 作为正确的列之一的数据类型.因此,架构也没有问题.但是,当我尝试通过 Hive 读取相同的 TIMESTAMP 列时,它会抛出异常

I checked the schema and it shows TIMESTAMP as the data type for one of the columns which is correct. So, no issue with schema either. However, when I am trying to read the same TIMESTAMP column via Hive, it throws below exception

Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritableV2 (state=,code=0)   

我可以从 this 链接看到这是一个未解决的问题蜂巢.不确定它是否已经解决了.有什么解决方法吗?加载数据时可以做些什么或加载后进行一些转换?

I can see from this link that it is an open issue in Hive. Not sure whether it is resolved yet or not. Is there any workaround for this? Something which can be done while loading data or some transformation once loaded?

推荐答案

我找到了解决我自己问题的替代方案.我将 TIMESTAMP 列的列类型更改为 STRING 并且在获取数据时我使用了 from_unixtime 方法将该特定列转换为预期的日期格式和能够取到它.
但是,这里的问题是,如果我的日期值是 2020-02-27 15:40:22 并且当我通过 Hive 获取该列的数据时它返回 EpochSeconds15340232000000.
所以,我通过以下查询在 Hive 中解决了这个问题:

I figured out an alternative to my own problem. I changed the column type of TIMESTAMP column to STRING and while fetching data I used from_unixtime method to cast that particular column to the intended date format and was able to fetch it.
But, the problem over here was if my date value is 2020-02-27 15:40:22 and when I fetched the data of this column via Hive it was returning EpochSeconds i.e 15340232000000.
So, I solved this problem in Hive via below query:

select *, from_unixtime(cast(SOURCE_LOAD_DATE as BIGINT) DIV 1000000) as SOURCE_LOAD_DATE from table_name;   

使用上述查询,我​​能够获得带有时间戳值的正确日期.

Using the above query I was able to get the proper date with timestamp value.

注意:您需要强制转换包含时间戳数据的每一列.

Note: You will need to cast every column which has timestamp data.

这是我能想到的唯一技巧.我希望这可以帮助您或其他人!

This is the only trick which I could think of. I hope this might help you or others!

这篇关于TIMESTAMP 与蜂巢中的镶木地板的行为不符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆