从MapReduce作业中将分区添加到Hive [英] Adding partitions to Hive from a MapReduce Job
问题描述
我已经定义了一个外部表日志和原始服务器上的日志
与hdfs / data / logs /
上的外部位置分区。我有一个MapReduce作业,它获取这些日志文件并将它们拆分并存储在上述文件夹下。像
/ data / logs / dt = 2012-10-01 / server01 /
/ data / logs / dt = 2012-10-01 / server02 /
...
...
从MapReduce作业中,我想将分区添加到Hive中的表日志。我知道这两种方法
- alter table命令 - 太多的alter table命令
- 添加动态分区
对于方法二,我只看到 INSERT OVERWRITE
一个选项给我。有没有办法在作业结束后将这些新分区添加到表中?解析方案
实际上,事情是比这更复杂一点,这很不幸,因为它在官方资料中是无证的(截至目前),并且需要几天的时间才能弄清楚。
I我发现我需要做以下事情才能让HCatalog Mapreduce作业写入动态分区:
在我作业的记录写作阶段(通常是reducer ),我不得不手动将我的动态分区(HCatFieldSchema)添加到我的HCatSchema对象中。他们需要手动添加
HCatFieldSchema hfs1 = new HCatFieldSchema(date,Type.STRING,null);
HCatFieldSchema hfs2 = new HCatFieldSchema(some_partition,Type.STRING,null);
schema.append(hfs1);
schema.append(hfs2);
I am new to Hive and MapReduce and would really appreciate your answer and also provide a right approach.
I have defined an external table logs
in hive partitioned on date and origin server with an external location on hdfs /data/logs/
. I have a MapReduce job which fetches these logs file and splits them and stores under the folder mentioned above. Like
"/data/logs/dt=2012-10-01/server01/"
"/data/logs/dt=2012-10-01/server02/"
...
...
From MapReduce job I would like add partitions to the table logs in Hive. I know the two approaches
- alter table command -- Too many alter table commands
- adding dynamic partitions
For approach two I see only examples of INSERT OVERWRITE
which is not an options for me. Is there a way to add these new partitions to the table after the end of the job?
In reality, things are a little more complicated than that, which is unfortunate because it is undocumented in official sources (as of now), and it takes a few days of frustration to figure out.
I've found that I need to do the following to get HCatalog Mapreduce jobs to work with writing to dynamic partitions:
In my record writing phase of my job (usually the reducer), I have to manually add my dynamic partitions (HCatFieldSchema) to my HCatSchema objects.
The trouble is that HCatOutputFormat.getTableSchema(config) does not actually return partitioned fields. They need to be manually added
HCatFieldSchema hfs1 = new HCatFieldSchema("date", Type.STRING, null);
HCatFieldSchema hfs2 = new HCatFieldSchema("some_partition", Type.STRING, null);
schema.append(hfs1);
schema.append(hfs2);
这篇关于从MapReduce作业中将分区添加到Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!