Hive 上的自定义 Map Reduce 程序,规则是什么?输入输出如何? [英] Custom Map Reduce Program on Hive, what's the Rule? How about input and output?
问题描述
我被困了几天,因为我想根据我在 hive 上的查询创建一个自定义 map reduce 程序,谷歌搜索后发现的例子不多,我仍然对规则感到困惑.
创建我的自定义mapreduce程序的规则是什么,mapper和reducer类呢?
谁能提供任何解决方案?
我想用Java开发这个程序,但我还是卡住了,然后在collector中格式化输出时,如何在mapper和reducer类中格式化结果?
有人想给我一些关于这类东西的例子和解释吗?
基本上有 2 种方法可以将自定义映射器/缩减器添加到 Hive 查询.
- 使用
变换
<块引用>
SELECT TRANSFORM(stuff1, stuff2) FROM table1 USING 'script' AS thing1,东西2
其中 stuff1, stuff2 是 table1 中的字段, script 是接受我稍后描述的格式的任何可执行文件.thing1,thing2 是脚本的输出
- 使用 map 和 reduce
<块引用>
FROM (从表MAP 表.f1 表.f2使用地图脚本"作为 mp1, mp2通过 mp1 聚类)map_outputINSERT OVERWRITE TABLE someothertable减少 map_output.mp1、map_output.mp2使用'reduce_script'AS reducef1, reducef2;
这稍微复杂一些,但提供了更多控制.这有两个部分.在第一部分中,映射器脚本将从 table
接收数据并将其映射到字段 mp1 和 mp2.然后将这些传递给 reduce_script
,该脚本将接收键上的排序输出,我们在 CLUSTER BY mp1
中指定.请注意,一个 reducer 会处理多个键.reduce 脚本的输出将转到表 someothertable
现在所有这些脚本都遵循一个简单的模式.他们将从标准输入中逐行读取.字段将被
分隔,它们将以相同的方式写回标准输出(字段由 ' ' 分隔)
看看这个博客,有一些很好的例子.
http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html
http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html
I got stuck for a few days because I want to create a custom map reduce program based on my query on hive, I found not many examples after googling and I'm still confused about the rule.
What is the rule to create my custom mapreduce program, how about the mapper and reducer class?
Can anyone provide any solution?
I want to develop this program in Java, but I'm still stuck ,and then when formatting output in collector, how do I format the result in mapper and reducer class?
Does anybody want to give me some example and explanation about this kind of stuff?
There are basically 2 ways to add custom mappers/reducers to hive queries.
- using
transform
SELECT TRANSFORM(stuff1, stuff2) FROM table1 USING 'script' AS thing1, thing2
where stuff1, stuff2 are the fields in table1 and script is any executable which accepts the format i describe later. thing1, thing2 are the outputs from script
- using map and reduce
FROM ( FROM table MAP table.f1 table.f2 USING 'map_script' AS mp1, mp2 CLUSTER BY mp1) map_output INSERT OVERWRITE TABLE someothertable REDUCE map_output.mp1, map_output.mp2 USING 'reduce_script' AS reducef1, reducef2;
This is slightly more complicated but gives more control. There are 2 parts to this. In the first part the mapper script will receive data from table
and map it to fields mp1 and mp2. these are then passed on to reduce_script
, this script will receive sorted output on the key, which we have specified in CLUSTER BY mp1
. mind you, more than one key will be handled by one reducer. The output of the reduce script will go to table someothertable
Now all these scripts follow a simple pattern. they will read line by line from stdin. The fields will be
separated and they will write back to stdout, in the same manner ( fields separated by ' ' )
Check out this blog, there are some nice examples.
http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html
http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html
这篇关于Hive 上的自定义 Map Reduce 程序,规则是什么?输入输出如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!