将值从Mapper传递到Reducer [英] Passing values from Mapper to Reducer

查看:140
本文介绍了将值从Mapper传递到Reducer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过查看映射器正在处理的当前文件(以及其他一些内容),我获得了少量的元数据。我需要将这些元数据发送给reducer。当然,我可以让映射器在<键值,值> 它产生为<关键,价值+元数据> ,但我想避免它。

另外,为了让自己更多一点,我不想使用DistributedCahce。那么,我还有一些选择吗?更确切地说,我的问题是双重的:(1)我尝试通过在我的映射器的 configure()中设置一个job.set(Prop,Value)来设置一些参数JobConf)并在我的reducer的 configure(JobConf)中执行job.get()。可悲的是,我发现它不起作用。作为一方,我有兴趣知道为什么会出现这种行为。我的主要问题是:(2)如何以干净的方式将映射器中的值发送到reducer(如果可能,在我想要的约束范围内) 。



编辑(鉴于Praveen Sripati的回应)


为了使它更具体,这是我想要的。根据发射的数据类型,我们希望它存储在不同的文件下(比如数据d1结束于D1,数据d2结束于D2)。

值D1和D2可以在配置文件中读取,并找出哪里取决于地图的值。 input.file 。也就是说,这对< k1,d1> 在一些处理后应该转到D1并且 k2,d2> 应该转到D2。我不想发布诸如< k1,d1 + D1> 。可以,我以某种方式获得关联而不发出D1或D2,也许巧妙地使用配置文件? k1,d1和k2,d2的输入源(即输入目录)是相同的,这只能通过map.input.file来查看。



$ p
$ b

关注

-Akash

基于发射的数据类型,我们希望它存储在不同的目录下(比如数据d1以D1结尾,而数据d2以D2结尾) 。

通常,MR作业的o / p将转到单个输出文件夹。每个映射器/缩减器都将写入一个单独的文件。我不确定如何在不改变Hadoop框架的情况下将MR作业o / p输出写入不同的目录。



但是,基于输出键/值类型从映射器/缩减器中可以选择输出文件。使用的子类MultipleOutputFormat 的。 MultipleOutputFormat#generateFileNameForKeyValue 方法必须实现,返回一个基于输入键的字符串。



请参阅如何在Hadoop - 权威指南手册的代码中实现PartitionByStationUsingMultipleOutputFormat。



完成作业后,可以使用hadoop命令轻松地将o / p移动到其他目录。


There is a little amount of meta-data that I get by looking up the current file the mapper is working on (and a few other things). I need to send over this meta-data to the reducer. Sure, I can have the mapper emit this in the < Key, Value> pair it generates as < Key, Value + Meta-Data>, but I want to avoid it.

Also, constraining myself a little bit more, I do not want to use DistributedCahce. So, do I still have some options left? More precisely, my question is twofold

(1) I tried setting up some parameters by doing a job.set(Prop, Value) in my mapper's configure(JobConf) and doing a job.get() in my reducer's configure(JobConf). Sadly, I found it does not work. As one aside, I am interested in knowing why this behavior. My main question is

(2) How can I send the value from the mapper to the reducer in a "clean way" (if possible, within the constraints I want).

EDIT (In view of response by Praveen Sripati)

To make it more concrete, here is what I want. Based on the type of data emitted we want it stored under different files (say data d1 ends up in D1 and data d2 ends up in D2).

The values D1 and D2 can be read in config file and figuring out what goes where depends on the value of map.input.file. That is, the pair < k1, d1> after some processing should go to D1 and < k2,d2> should go to D2. I do not want to emit things like < k1, d1+D1>. Can, I somehow obtain figure out the association without emitting D1 or D2, maybe by cleverly using the config file? The input source (i.e., input directory) for k1,d1 and k2,d2 is the same which again can be seen only through map.input.file

Please let me know when you get time.

Regards
-Akash

解决方案

Based on the type of data emitted we want it stored under different directories (say data d1 ends up in D1 and data d2 ends up in D2).

Usually the o/p of the MR job will go to a single output folder. Each mapper/reducer will write to a separate file. I am not sure how to write an MR job o/p output to different directories without any changes to the Hadoop framework.

But, based on the output key/value types from the mapper/reducer the output file can be choosen. Use the subclasses of the MultipleOutputFormat. The MultipleOutputFormat#generateFileNameForKeyValue method has to be implemented, return a string based on the input key.

See how PartitionByStationUsingMultipleOutputFormat is implemented in the code of the Hadoop - The Definitive Guide book.

Once the job has been completed, the o/p can be moved easily using hadoop commands to a different directory.

这篇关于将值从Mapper传递到Reducer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆