Hadoop映射器从2个不同的源输入文件中读取数据 [英] Hadoop mapper reading from 2 different source input files
问题描述
我有一个工具,它链接了很多Mappers& Reducers,在某些时候我需要合并来自以前的map-reduce步骤的结果,例如输入我有两个带有数据的文件:
/input/a.txt
苹果,10
橙色,20
* / input / b.txt *
苹果; 5
橙色; 40
结果应该是c.txt,其中<strong> c.value = a.value * b.value
/output/c.txt
apple,50 // 10 * 5
橙色,800 // 40 * 20
怎么做?我已经通过简单的Key => MyMapWritable(type = 1,2,value)解决了这个问题,并在reducer中合并(实际上是乘法)数据。它的工作原理,但:
- 感觉它可以做得更容易(味道不好)
- 是否有可能以某种方式知道Mapper内部哪个文件被用作记录提供者(a.txt或b.txt)。就目前而言,我只是使用了不同的分离器:昏迷&分号:($ / $>
假设他们已被分区并按照相同的方式排序,那么你可以使用 CompositeInputFormat 来执行地图边连接。有一篇关于使用它的文章这里。我不认为它已经被移植到新的mapreduce api中。
其次,你可以通过调用 code> context.getInputSplit(),这将返回InputSplit,如果您使用 TextInputFormat
,则可以投射到 FileInputSplit
,然后调用 getPath()
来获取文件名。 CompositeInputFormat,但你不知道TupleWritable中的Writable已经来了从。 I have tool which chains a lot of Mappers & Reducers, and at some point I need merge results from previous map-reduce steps, for example as input I have two files with data:
/input/a.txt
apple,10
orange,20
*/input/b.txt*
apple;5
orange;40
result should be c.txt, where c.value = a.value * b.value
/output/c.txt
apple,50 // 10 * 5
orange,800 // 40 * 20
How it could be done? I've resolved this with introducing simple Key => MyMapWritable (type=1,2, value), and merging (actually, multiplying) data in reducers. It works, but:
- have feeling that it could be done easier (smells not good)
- is it possible somehow to know inside Mapper which exactly file was used as record provider (a.txt or b.txt). For now, I just used different separators: coma & semicolon :(
Assuming they have been partitioned and sorted in the same way, then you can use the CompositeInputFormat to perform a map-side-join. There's an article on using it here. I don't think it's been ported to the new mapreduce api though.
Secondly, you can get the input file in the mapper by calling context.getInputSplit()
, this will return the InputSplit, which if you're using TextInputFormat
, you can cast to a FileInputSplit
and then call getPath()
to get the file name. I don't think you can use this method with CompositeInputFormat though as you won't know where the Writables in the TupleWritable have come from.
这篇关于Hadoop映射器从2个不同的源输入文件中读取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!