Parquet支持的Hive表:数组列在Impala中不可查询 [英] Parquet-backed Hive table: array column not queryable in Impala
问题描述
在我浏览文档时,我发现Impala预计数据将以Parquet格式存储。我的数据原始形式恰好是一个两列CSV,第一列是ID,第二列是管道分隔的字符串数组,例如:
123,ASDFG | SDFGH | DFGHJ | FGHJK
234,QWERT | WERTY | ERTYU
创建了一个Hive表:
CREATE TABLE`id_member_of`(
`id` INT,
`member_of` ARRAY< STRING>)
行格式限定
字段终止'''
收款项'|'
'\\\
'
作为INPUTFORMAT存储的行终止
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql .io.HiveIgnoreKeyTextOutputFormat';
原始数据已加载到Hive表中:
LOAD DATA LOCAL INPATH'raw_data.csv'INTO TABLE id_member_of;
创建表的Parquet版本:
CREATE TABLE`id_member_of_parquet`(
`id` STRING,
`member_of` ARRAY STRING>)
STORED AS PARQUET;
CSV支持表中的数据已插入到Parquet表中:
INSERT INTO id_member_of_parquet SELECT id,member_of FROM id_member_of;
现在可以在Hive中查询Parquet表:
hive> select * from id_member_of_parquet;
123 [ASDFG,SDFGH,DFGHJ,FGHJK]
234 [QWERT,WERTY,ERTYU]
奇怪的是,当我在Impala中查询相同的Parquet支持的表时,它不返回数组列:
[hadoop01:21000]>元数据无效;
[hadoop01:21000]> select * from id_member_of_parquet;
+ ----- +
| id |
+ ----- +
| 123 |
| 234 |
+ ----- +
问题:数组列发生了什么?你能看到我做错了什么吗?
结果非常简单:我们可以通过将数组添加到 FROM 带点,例如
查询:select * from id_member_of_parquet,id_member_of_parquet.member_of
+ ----- + ------- +
| id | item |
+ ----- + ------- +
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+ ----- + ------- +
Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM
with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+
这篇关于Parquet支持的Hive表:数组列在Impala中不可查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!