蜂巢:蜂巢在使用外部表时是否支持分区和存储 [英] Hive: Does hive support partitioning and bucketing while usiing external tables

查看:68
本文介绍了蜂巢:蜂巢在使用外部表时是否支持分区和存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在创建Hive表时使用 PARTITIONED BY CLUSTERED BY 关键字时,配置单元将创建与每个分区或存储桶相对应的单独文件.但是对于外部表,这仍然有效.据我了解,与外部文件相对应的数据文件不是由配置单元管理的.配置单元也会创建与每个分区或存储桶相对应的其他文件,并将相应的数据移入这些文件.

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables, hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.

编辑-添加详细信息.
《 Hadoop:权威指南》-《第17章:Hive》中的摘录很少
创建表日志(ts BIGINT,第STRING行)分区为(dt STRING,国家/地区STRING);

Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);

当我们将数据加载到分区表中时,分区值是明确指定的:

When we load data into a partitioned table, the partition values are specified explicitly:

LOAD DATA LOCAL INPATH'input/hive/partitions/file1'INTO TABLE日志PARTITION(dt ='2001-01-01',country ='GB');

在文件系统级别,分区只是表目录的嵌套子目录.将更多文件加载到日志表后,目录结构可能如下所示:

At the filesystem level, partitions are simply nested sub directories of the table directory. After loading a few more files into the logs table, the directory structure might look like this:

上面的表显然是一个托管表,因此配置单元具有数据的所有权,并为每个分区创建了目录结构,就像上面的树结构一样.

The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.

如果是外部
创建外部表日志(ts BIGINT,第STRING行),分区方式为(dt STRING,国家/地区STRING);
紧随其后的是同一组加载操作-
LOAD DATA LOCAL INPATH'input/hive/partitions/file1'INTO TABLE日志PARTITION(dt ='2001-01-01',country ='GB');

配置单元如何处理这些分区.对于没有分区的外部表,hive将仅指向数据文件,并通过解析数据文件来获取任何查询结果.但是,如果将数据加载到已分区的外部表中,则会在何处创建分区.

How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.

希望在蜂巢仓库中完全生存吗?有人可以支持或澄清这一点吗?

Hope fully in hive warehouse? Can someone support or clarify this?

推荐答案

有一种简单的方法可以做到这一点.首先创建您的外部配置单元表.

There is an easy way to do this. Create your External Hive table first.

CREATE EXTERNAL TABLE database.table (
    id integer,
    name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';

接下来,您必须运行MSCK命令(元存储一致性检查)

Next you have to run a MSCK command (metastore consistency check)

 msck repair table database.table

此命令将恢复路径中可用的所有分区并更新元存储.现在,如果您对表运行查询,将从所有分区中检索数据.

This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

这篇关于蜂巢:蜂巢在使用外部表时是否支持分区和存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆