Hadoop映射器从2个不同的源输入文件中读取数据 [英] Hadoop mapper reading from 2 different source input files

查看:109
本文介绍了Hadoop映射器从2个不同的源输入文件中读取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个工具,它链接了很多Mappers& Reducers,在某些时候我需要合并来自以前的map-reduce步骤的结果,例如输入我有两个带有数据的文件:

  /input/a.txt 
苹果,10
橙色,20

* / input / b.txt *
苹果; 5
橙色; 40

结果应该是c.txt,其中<​​strong> c.value = a.value * b.value

  /output/c.txt 
apple,50 // 10 * 5
橙色,800 // 40 * 20

怎么做?我已经通过简单的Key => MyMapWritable(type = 1,2,value)解决了这个问题,并在reducer中合并(实际上是乘法)数据。它的工作原理,但:


  1. 感觉它可以做得更容易(味道不好)

  2. 是否有可能以某种方式知道Mapper内部哪个文件被用作记录提供者(a.txt或b.txt)。就目前而言,我只是使用了不同的分离器:昏迷&分号:($ / $>


解决方案

假设他们已被分区并按照相同的方式排序,那么你可以使用 CompositeInputFormat 来执行地图边连接。有一篇关于使用它的文章这里。我不认为它已经被移植到新的mapreduce api中。

其次,你可以通过调用 code> context.getInputSplit(),这将返回InputSplit,如果您使用 TextInputFormat ,则可以投射到 FileInputSplit ,然后调用 getPath()来获取文件名。 CompositeInputFormat,但你不知道TupleWritable中的Writable已经来了从。


I have tool which chains a lot of Mappers & Reducers, and at some point I need merge results from previous map-reduce steps, for example as input I have two files with data:

/input/a.txt
apple,10
orange,20

*/input/b.txt*
apple;5
orange;40

result should be c.txt, where c.value = a.value * b.value

/output/c.txt
apple,50   // 10 * 5
orange,800 // 40 * 20

How it could be done? I've resolved this with introducing simple Key => MyMapWritable (type=1,2, value), and merging (actually, multiplying) data in reducers. It works, but:

  1. have feeling that it could be done easier (smells not good)
  2. is it possible somehow to know inside Mapper which exactly file was used as record provider (a.txt or b.txt). For now, I just used different separators: coma & semicolon :(

解决方案

Assuming they have been partitioned and sorted in the same way, then you can use the CompositeInputFormat to perform a map-side-join. There's an article on using it here. I don't think it's been ported to the new mapreduce api though.

Secondly, you can get the input file in the mapper by calling context.getInputSplit(), this will return the InputSplit, which if you're using TextInputFormat, you can cast to a FileInputSplit and then call getPath() to get the file name. I don't think you can use this method with CompositeInputFormat though as you won't know where the Writables in the TupleWritable have come from.

这篇关于Hadoop映射器从2个不同的源输入文件中读取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆