Hive 上的自定义 Map Reduce 程序,规则是什么?输入输出如何? [英] Custom Map Reduce Program on Hive, what's the Rule? How about input and output?

查看:29
本文介绍了Hive 上的自定义 Map Reduce 程序,规则是什么?输入输出如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被困了几天,因为我想根据我在 hive 上的查询创建一个自定义 map reduce 程序,谷歌搜索后发现的例子不多,我仍然对规则感到困惑.

创建我的自定义mapreduce程序的规则是什么,mapper和reducer类呢?

谁能提供任何解决方案?

我想用Java开发这个程序,但我还是卡住了,然后在collector中格式化输出时,如何在mapper和reducer类中格式化结果?

有人想给我一些关于这类东西的例子和解释吗?

解决方案

基本上有 2 种方法可以将自定义映射器/缩减器添加到 Hive 查询.

  1. 使用变换

<块引用>

SELECT TRANSFORM(stuff1, stuff2) FROM table1 USING 'script' AS thing1,东西2

其中 stuff1, stuff2 是 table1 中的字段, script 是接受我稍后描述的格式的任何可执行文件.thing1,thing2 是脚本的输出

  1. 使用 map 和 reduce

<块引用>

FROM (从表MAP 表.f1 表.f2使用地图脚本"作为 mp1, mp2通过 mp1 聚类)map_outputINSERT OVERWRITE TABLE someothertable减少 map_output.mp1、map_output.mp2使用'reduce_script'AS reducef1, reducef2;

这稍微复杂一些,但提供了更多控制.这有两个部分.在第一部分中,映射器脚本将从 table 接收数据并将其映射到字段 mp1 和 mp2.然后将这些传递给 reduce_script,该脚本将接收键上的排序输出,我们在 CLUSTER BY mp1 中指定.请注意,一个 reducer 会处理多个键.reduce 脚本的输出将转到表 someothertable

现在所有这些脚本都遵循一个简单的模式.他们将从标准输入中逐行读取.字段将被 分隔,它们将以相同的方式写回标准输出(字段由 ' ' 分隔)

看看这个博客,有一些很好的例子.

http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html

http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html

I got stuck for a few days because I want to create a custom map reduce program based on my query on hive, I found not many examples after googling and I'm still confused about the rule.

What is the rule to create my custom mapreduce program, how about the mapper and reducer class?

Can anyone provide any solution?

I want to develop this program in Java, but I'm still stuck ,and then when formatting output in collector, how do I format the result in mapper and reducer class?

Does anybody want to give me some example and explanation about this kind of stuff?

解决方案

There are basically 2 ways to add custom mappers/reducers to hive queries.

  1. using transform

SELECT TRANSFORM(stuff1, stuff2) FROM table1 USING 'script' AS thing1, thing2

where stuff1, stuff2 are the fields in table1 and script is any executable which accepts the format i describe later. thing1, thing2 are the outputs from script

  1. using map and reduce

FROM (
    FROM table
    MAP table.f1 table.f2
    USING 'map_script'
    AS mp1, mp2
    CLUSTER BY mp1) map_output
  INSERT OVERWRITE TABLE someothertable
    REDUCE map_output.mp1, map_output.mp2
    USING 'reduce_script'
    AS reducef1, reducef2;

This is slightly more complicated but gives more control. There are 2 parts to this. In the first part the mapper script will receive data from table and map it to fields mp1 and mp2. these are then passed on to reduce_script, this script will receive sorted output on the key, which we have specified in CLUSTER BY mp1. mind you, more than one key will be handled by one reducer. The output of the reduce script will go to table someothertable

Now all these scripts follow a simple pattern. they will read line by line from stdin. The fields will be separated and they will write back to stdout, in the same manner ( fields separated by ' ' )

Check out this blog, there are some nice examples.

http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html

http://dev.bizo.com/2009/10/reduce-scripts-in-hive.html

这篇关于Hive 上的自定义 Map Reduce 程序,规则是什么?输入输出如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆