从MapReduce作业中将分区添加到Hive [英] Adding partitions to Hive from a MapReduce Job

查看:322
本文介绍了从MapReduce作业中将分区添加到Hive的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我已经定义了一个外部表日志和原始服务器上的日志与hdfs / data / logs / 上的外部位置分区。我有一个MapReduce作业,它获取这些日志文件并将它们拆分并存储在上述文件夹下。像

 / data / logs / dt = 2012-10-01 / server01 /
/ data / logs / dt = 2012-10-01 / server02 /
...
...

从MapReduce作业中,我想将分区添加到Hive中的表日志。我知道这两种方法


  1. alter table命令 - 太多的alter table命令

  2. 添加动态分区

对于方法二,我只看到 INSERT OVERWRITE 一个选项给我。有没有办法在作业结束后将这些新分区添加到表中?解析方案

实际上,事情是比这更复杂一点,这很不幸,因为它在官方资料中是无证的(截至目前),并且需要几天的时间才能弄清楚。

I我发现我需要做以下事情才能让HCatalog Mapreduce作业写入动态分区:



在我作业的记录写作阶段(通常是reducer ),我不得不手动将我的动态分区(HCatFieldSchema)添加到我的HCatSchema对象中。他们需要手动添加

  HCatFieldSchema hfs1 = new HCatFieldSchema(date,Type.STRING,null); 
HCatFieldSchema hfs2 = new HCatFieldSchema(some_partition,Type.STRING,null);
schema.append(hfs1);
schema.append(hfs2);


I am new to Hive and MapReduce and would really appreciate your answer and also provide a right approach.

I have defined an external table logs in hive partitioned on date and origin server with an external location on hdfs /data/logs/. I have a MapReduce job which fetches these logs file and splits them and stores under the folder mentioned above. Like

"/data/logs/dt=2012-10-01/server01/"
"/data/logs/dt=2012-10-01/server02/"
...
...

From MapReduce job I would like add partitions to the table logs in Hive. I know the two approaches

  1. alter table command -- Too many alter table commands
  2. adding dynamic partitions

For approach two I see only examples of INSERT OVERWRITE which is not an options for me. Is there a way to add these new partitions to the table after the end of the job?

解决方案

In reality, things are a little more complicated than that, which is unfortunate because it is undocumented in official sources (as of now), and it takes a few days of frustration to figure out.

I've found that I need to do the following to get HCatalog Mapreduce jobs to work with writing to dynamic partitions:

In my record writing phase of my job (usually the reducer), I have to manually add my dynamic partitions (HCatFieldSchema) to my HCatSchema objects.

The trouble is that HCatOutputFormat.getTableSchema(config) does not actually return partitioned fields. They need to be manually added

HCatFieldSchema hfs1 = new HCatFieldSchema("date", Type.STRING, null);
HCatFieldSchema hfs2 = new HCatFieldSchema("some_partition", Type.STRING, null);
schema.append(hfs1);
schema.append(hfs2);

这篇关于从MapReduce作业中将分区添加到Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆