我怎样才能将桌子与HIVE分开? [英] How can I partition a table with HIVE?

查看:151
本文介绍了我怎样才能将桌子与HIVE分开?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我一直在记录Apache日志(组合格式)在Hadoop几个月。它们以行文本格式存储,按日期划分(通过水槽):
/ logs / yyyy / mm / dd / hh / *

示例: / p>

  / logs / 2012/02/10/00 / Part01xx(02/10/2012 12:00 am)
/ logs / 2012/02/10/00 / Part02xx
/ logs / 2012/02/10/13 / Part0xxx(02/10/2012 01:00 pm)

组合日志文件遵循这种格式[10 / Feb / 2012:00:00:00 -0800]

如何创建一个Hive中使用我的物理分区的分区中的外部表。我在Hive分区上找不到任何好的文档。我发现相关的问题如:



如果我将我的日志加载到外部表中Hive,我不能随时分割,因为它不是很好的格式(Feb< => 02)。即使它的格式很好,我如何将字符串10/02/2012:00:00:00 -0800转换为多个目录/ 2012/02/10/00?



我最终可以使用猪脚本将我的原始日志转换为配置单元表,但此时我应该使用猪而不是配置单元来进行报告。
<如果我理解正确,你将文件放在目录日志4级的文件夹中。在这种情况下,您将表定义为外部路径为'logs'并由4个虚拟字段进行分区:年,月,日,月,对于Flume来说。


编辑3/9:很多细节取决于Flume如何正确写入文件。但一般来说,你的DDL应该看起来像这样:

  CREATE TABLE table_name(fields ...)
PARTITION BY(log_year STRING,log_month STRING,
log_day_of_month STRING,log_hour_of_day STRING)
格式描述
作为文本文件存储
LOCATION'/ your user path / logs';

编辑3/15:每个zzarbi请求,我要添加一个请注意,在创建表之后,需要向Hive发送有关创建的分区的信息。只要Flume或其他进程创建新分区,就需要重复执行此操作。 使用分区创建外部问题。


I've been playing with Hive for few days now but I still have a hard time with partition.

I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/*

Example:

/logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am)
/logs/2012/02/10/00/Part02xx
/logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm)

The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800]

How can I create a external table with partition in Hive that use my physical partition. I can't find any good documentation on Hive partition. I found related Question such as:

If I load my logs in an external table with Hive, I cannot partition with the time, since it's not the good format (Feb <=> 02). Even if if it was in a good format how do i transform a string "10/02/2012:00:00:00 -0800" into multiple directory "/2012/02/10/00"?

I could eventually use pig script to convert my raw logs into hive tables but at this point I should just be using pig instead of hive to do my reporting.

解决方案

If I understand correctly, you have files in the folders 4 level deep from the directory logs. In that case, you define your table as external with path 'logs' and partitioned by 4 virtual fields: year, month, day_of_month, hour_of_day.

The partitioning is essentially done for you by Flume.

EDIT 3/9: A lot of details depends on how exactly Flume writes files. But in general terms, your DDL should look something like this:

CREATE TABLE table_name(fields...)
PARTITIONED BY(log_year STRING, log_month STRING, 
    log_day_of_month STRING, log_hour_of_day STRING)
format description
STORED AS TEXTFILE
LOCATION '/your user path/logs';

EDIT 3/15: Per zzarbi request, I'm adding a note that after the table is created, the Hive needs to be informed about partitions created. This needs to be done repeatedly as long as Flume or other process creates new partitions. See my answer to Create external with Partition question.

这篇关于我怎样才能将桌子与HIVE分开?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆