Parquet 支持的 Hive 表:数组列在 Impala 中不可查询 [英] Parquet-backed Hive table: array column not queryable in Impala

查看:16
本文介绍了Parquet 支持的 Hive 表:数组列在 Impala 中不可查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管 Impala 比 Hive 快得多,但我们使用 Hive 是因为它支持复杂(嵌套)数据类型,例如数组和映射.

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.

我注意到 Impala,从 CDH5.5,现在支持复杂数据类型.由于也可以在 Impala 中运行 Hive UDF,我们可能可以在 Impala 中做我们想做的一切,但要快得多.这是个好消息!

I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!

当我浏览文档时,我看到 Impala 期望数据以 Parquet 格式存储.我的原始数据恰好是一个两列的 CSV,其中第一列是 ID,第二列是管道分隔的字符串数组,例如:

As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:

123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU

创建了一个 Hive 表:

A Hive table was created:

CREATE TABLE `id_member_of`(
  `id` INT, 
  `member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  COLLECTION ITEMS TERMINATED BY '|' 
  LINES TERMINATED BY '
' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

原始数据已加载到 Hive 表中:

The raw data was loaded into the Hive table:

LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;

该表的 Parquet 版本已创建:

A Parquet version of the table was created:

CREATE TABLE `id_member_of_parquet` (
 `id` STRING, 
 `member_of` ARRAY<STRING>) 
STORED AS PARQUET;

CSV 支持的表中的数据被插入到 Parquet 表中:

The data from the CSV-backed table was inserted into the Parquet table:

INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;

现在可以在 Hive 中查询 Parquet 表:

And the Parquet table is now queryable in Hive:

hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]

奇怪的是,当我在 Impala 中查询同一个 Parquet-backed 表时,它没有返回数组列:

Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:

[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id  |
+-----+
| 123 |
| 234 |
+-----+

问题:数组列发生了什么变化?你能看出我做错了什么吗?

Question: What happened to the array column? Can you see what I'm doing wrong?

推荐答案

结果很简单:我们可以通过将数组添加到 FROM 中并用一个点来访问数组,例如

It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.

Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id  | item  |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+

这篇关于Parquet 支持的 Hive 表:数组列在 Impala 中不可查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆