如何使用Hive使用单个hdfs路径创建n个外部表 [英] How to create n number of external tables with a single hdfs path using Hive

查看:225
本文介绍了如何使用Hive使用单个hdfs路径创建n个外部表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用Hive创建n个指向单个hdfs路径的外部表.如果可以,优势和局限性是什么?

解决方案

可以在HDFS中同一位置的顶部创建许多表(同时管理表和外部表).

在相同数据之上创建具有完全相同模式的表根本没有用,但是您可以使用例如RegexSerDe创建具有不同列数或具有不同分析列的不同表,因此您可以拥有不同的表这些表中的模式.您可以在Hive中对这些表具有不同的权限.还可以在其他一些表文件夹的子文件夹的顶部创建表,在这种情况下,它将包含数据的子集.最好在单个表中使用相同的分区.

缺点是它令人困惑,因为您可以使用多个表重写相同的数据,并且可能会意外删除该数据,以为该数据属于唯一的表,并且可以删除数据,因为您不需要该数据桌子了.

这是一些测试:

使用INT列创建表:

create table T(id int);
OK
Time taken: 1.033 seconds

检查位置和其他属性:

hive> describe formatted T;
OK
# col_name              data_type               comment

id                      int

# Detailed Table Information
Database:               my
Owner:                  myuser
CreateTime:             Fri Jan 04 04:45:03 PST 2019
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://myhdp/user/hive/warehouse/my.db/t
Table Type:             MANAGED_TABLE
Table Parameters:
        transient_lastDdlTime   1546605903

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.134 seconds, Fetched: 26 row(s)
                                                                                                  sts)

在同一位置的顶部创建第二个表,但带有STRING列:

hive> create table T2(id string) location 'hdfs://myhdp/user/hive/warehouse/my.db/t';
OK
Time taken: 0.029 seconds

插入数据:

hive> insert into table T values(1);
OK
Time taken: 33.266 seconds

检查数据:

hive> select * from T;
OK
1
Time taken: 3.314 seconds, Fetched: 1 row(s)

插入第二张表:

hive> insert into table T2 values( 'A');
OK
Time taken: 23.959 seconds

检查数据:

hive> select * from T2;
OK
1
A
Time taken: 0.073 seconds, Fetched: 2 row(s)

从第一个表中选择:

hive> select * from T;
OK
1
NULL
Time taken: 0.079 seconds, Fetched: 2 row(s)

字符串被选择为NULL,因为此表被定义为具有INT列.

现在将STRING插入第一个表(INT列):

insert into table T values( 'A');
OK
Time taken: 84.336 seconds

惊喜,它没有​​失败!

插入了什么?

hive> select * from T2;
OK
1
A
NULL
Time taken: 0.067 seconds, Fetched: 3 row(s)

插入了NULL,因为在之前的插入字符串期间将其转换为int并导致NULL

现在让我们尝试删除一个表并从另一个表中进行选择:

hive> drop table T;
OK
Time taken: 4.996 seconds
hive> select * from T2;
OK
Time taken: 6.978 seconds

返回了0行,因为第一个表已被管理,而放置表也删除了公共位置.

结束,

数据已删除,我们需要没有数据的T2表吗?

drop table T2;
OK

第二张表已删除,您看到的只是元数据.该表也得到了管理,drop table也应删除数据所在的位置,但是在HDFS中已经没有要删除的内容,仅删除了元数据.

Is it possible to create n number of external tables are pointing to a single hdfs path using Hive. If yes what are the advantages and its limitations.

解决方案

It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.

Creating tables with exactly the same schema on top of the same data is not useful at all, but you can create different tables with different number of columns for example or with differently parsed columns using RegexSerDe for example, so you can have different schemas in these tables. And you can have different permissions on these tables in Hive. Also table can be created on top of the sub-folder of some other tables folder, in this case it will contain a sub-set of data. Better use partitions in single table for the same.

And the drawback is that it is confusing because you can rewrite the same data using more than one table and also you may drop it accidentally, thinking this data belongs to the only table and you can drop data because you do not need that table any more.

And this is few tests:

Create table with INT column:

create table T(id int);
OK
Time taken: 1.033 seconds

Check location and other properties:

hive> describe formatted T;
OK
# col_name              data_type               comment

id                      int

# Detailed Table Information
Database:               my
Owner:                  myuser
CreateTime:             Fri Jan 04 04:45:03 PST 2019
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://myhdp/user/hive/warehouse/my.db/t
Table Type:             MANAGED_TABLE
Table Parameters:
        transient_lastDdlTime   1546605903

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.134 seconds, Fetched: 26 row(s)
                                                                                                  sts)

Create second table on top of the same location but with STRING column:

hive> create table T2(id string) location 'hdfs://myhdp/user/hive/warehouse/my.db/t';
OK
Time taken: 0.029 seconds

Insert data:

hive> insert into table T values(1);
OK
Time taken: 33.266 seconds

Check data:

hive> select * from T;
OK
1
Time taken: 3.314 seconds, Fetched: 1 row(s)

Insert into second table:

hive> insert into table T2 values( 'A');
OK
Time taken: 23.959 seconds

Check data:

hive> select * from T2;
OK
1
A
Time taken: 0.073 seconds, Fetched: 2 row(s)

Select from first table:

hive> select * from T;
OK
1
NULL
Time taken: 0.079 seconds, Fetched: 2 row(s)

String was selected as NULL because this table is defined as having INT column.

And now insert STRING into first table (INT column):

insert into table T values( 'A');
OK
Time taken: 84.336 seconds

Surprise, it is not failing!

What was inserted?

hive> select * from T2;
OK
1
A
NULL
Time taken: 0.067 seconds, Fetched: 3 row(s)

NULL was inserted, because during previous insert string was converted to int and this resulted in NULL

Now let's try to drop one table and select from another one:

hive> drop table T;
OK
Time taken: 4.996 seconds
hive> select * from T2;
OK
Time taken: 6.978 seconds

Returned 0 rows because first table was MANAGED and drop table also removed common location.

THE END,

data is removed, do We need T2 table without data in it?

drop table T2;
OK

Second table is removed, you see, it was metadata only. The table was also managed and drop table should remove the location with data also, but it's already nothing to remove in HDFS, only metadata was removed.

这篇关于如何使用Hive使用单个hdfs路径创建n个外部表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆