TIMESTAMP在蜂巢中的拼花表现不正常 [英] TIMESTAMP not behaving as intended with parquet in hive

查看:102
本文介绍了TIMESTAMP在蜂巢中的拼花表现不正常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有parquet个数据,当使用spark读取(TIMESTAMP列)时,它们工作得很好.下面是示例记录:

I have parquet data which when read (TIMESTAMP column) using spark works perfectly fine. Below are the sample records:

scala> data.select($"BATCH_LOAD_DATE").limit(5).show()

+-------------------+
|    BATCH_LOAD_DATE|
+-------------------+
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
+-------------------+   

我检查了模式,它显示TIMESTAMP作为其中一列正确的数据类型.因此,架构也没有问题.但是,当我尝试通过Hive读取同一TIMESTAMP列时,它会抛出以下异常

I checked the schema and it shows TIMESTAMP as the data type for one of the columns which is correct. So, no issue with schema either. However, when I am trying to read the same TIMESTAMP column via Hive, it throws below exception

Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritableV2 (state=,code=0)   

我可以从链接中看到,这是一个未解决的问题蜂巢.不确定是否已解决.有什么解决方法吗?加载数据时可以做些什么,或者加载后进行一些转换?

I can see from this link that it is an open issue in Hive. Not sure whether it is resolved yet or not. Is there any workaround for this? Something which can be done while loading data or some transformation once loaded?

推荐答案

我想出了解决自己问题的替代方法.我将TIMESTAMP列的列类型更改为STRING,并且在获取数据时,我使用from_unixtime方法将该特定列转换为预期的日期格式,并能够获取它. 但是,这里的问题是如果我的日期值为2020-02-27 15:40:22,并且当我通过Hive获取此列的数据时,它正在返回EpochSeconds15340232000000.
因此,我通过以下查询在Hive中解决了此问题:

I figured out an alternative to my own problem. I changed the column type of TIMESTAMP column to STRING and while fetching data I used from_unixtime method to cast that particular column to the intended date format and was able to fetch it.
But, the problem over here was if my date value is 2020-02-27 15:40:22 and when I fetched the data of this column via Hive it was returning EpochSeconds i.e 15340232000000.
So, I solved this problem in Hive via below query:

select *, from_unixtime(cast(SOURCE_LOAD_DATE as BIGINT) DIV 1000000) as SOURCE_LOAD_DATE from table_name;   

使用上述查询,我​​可以获取带有时间戳记值的正确日期.

Using the above query I was able to get the proper date with timestamp value.

注意::您将需要转换具有时间戳数据的每一列.

Note: You will need to cast every column which has timestamp data.

这是我能想到的唯一技巧.希望对您或其他人有帮助!

This is the only trick which I could think of. I hope this might help you or others!

这篇关于TIMESTAMP在蜂巢中的拼花表现不正常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆