如何使用 HIVE 对表进行分区? [英] How can I partition a table with HIVE?

查看:32
本文介绍了如何使用 HIVE 对表进行分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用 Hive 玩了几天了,但我仍然很难分区.

I've been playing with Hive for few days now but I still have a hard time with partition.

几个月来,我一直在 Hadoop 中记录 Apache 日志(组合格式).它们以行文本格式存储,按日期分区(通过水槽):/logs/yyyy/mm/dd/hh/*

I've been recording Apache logs (Combine format) in Hadoop for few months. They are stored in row text format, partitioned by date (via flume): /logs/yyyy/mm/dd/hh/*

示例:

/logs/2012/02/10/00/Part01xx (02/10/2012 12:00 am)
/logs/2012/02/10/00/Part02xx
/logs/2012/02/10/13/Part0xxx (02/10/2012 01:00 pm)

组合日志文件中的日期遵循这种格式[2012 年 2 月 10 日:00:00:00 -0800]

The date in the combined log file is following this format [10/Feb/2012:00:00:00 -0800]

如何在使用我的物理分区的 Hive 中创建一个带有分区的外部表.我在 Hive 分区上找不到任何好的文档.我发现了相关的问题,例如:

How can I create a external table with partition in Hive that use my physical partition. I can't find any good documentation on Hive partition. I found related Question such as:

如果我使用 Hive 将日志加载到外部表中,则无法按时间进行分区,因为它不是好的格式(2 月 <=> 02).即使格式很好,我如何将字符串10/02/2012:00:00:00 -0800"转换为多个目录/2012/02/10/00"?

If I load my logs in an external table with Hive, I cannot partition with the time, since it's not the good format (Feb <=> 02). Even if if it was in a good format how do i transform a string "10/02/2012:00:00:00 -0800" into multiple directory "/2012/02/10/00"?

我最终可以使用 pig 脚本将我的原始日志转换为 hive 表,但此时我应该只使用 pig 而不是 hive 来进行我的报告.

I could eventually use pig script to convert my raw logs into hive tables but at this point I should just be using pig instead of hive to do my reporting.

推荐答案

如果我理解正确,您在目录日志的第 4 级文件夹中有文件.在这种情况下,您将表定义为带有路径日志"的外部表,并按 4 个虚拟字段进行分区:年、月、day_of_month、hour_of_day.

If I understand correctly, you have files in the folders 4 level deep from the directory logs. In that case, you define your table as external with path 'logs' and partitioned by 4 virtual fields: year, month, day_of_month, hour_of_day.

分区基本上是由 Flume 为您完成的.

The partitioning is essentially done for you by Flume.

EDIT 3/9:许多细节取决于 Flume 写入文件的准确方式.但一般而言,您的 DDL 应如下所示:

EDIT 3/9: A lot of details depends on how exactly Flume writes files. But in general terms, your DDL should look something like this:

CREATE TABLE table_name(fields...)
PARTITIONED BY(log_year STRING, log_month STRING, 
    log_day_of_month STRING, log_hour_of_day STRING)
format description
STORED AS TEXTFILE
LOCATION '/your user path/logs';

EDIT 3/15: 根据 zzarbi 请求,我添加了一条说明,即在创建表后,需要将创建的分区通知 Hive.只要 Flume 或其他进程创建新分区,就需要重复执行此操作.请参阅我对使用分区创建外部问题的回答.

EDIT 3/15: Per zzarbi request, I'm adding a note that after the table is created, the Hive needs to be informed about partitions created. This needs to be done repeatedly as long as Flume or other process creates new partitions. See my answer to Create external with Partition question.

这篇关于如何使用 HIVE 对表进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆