MapReduce一对一处理多个输入文件 [英] MapReduce One-to-one processing of multiple input files

查看:551
本文介绍了MapReduce一对一处理多个输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请澄清

我有一组具有特定名称的输入文件(比如10)。我一次对所有文件运行字数统计作业(输入路径是文件夹)。我期待与输入文件具有相同名称的10个输出文件。即应该计数File1输入,并应将其存储在具有file1名称的单独输出文件中。对所有文件都是如此。

解决方案


  1. 等于输入文件的数量。这也会创建给定数量的输出文件。 为每个地图输出键(单词)添加文件前缀。例如,当你在名为file0.txt的文件中遇到单词cat时,你可以发出密钥0_cat或file0_cat或其他任何对file0.txt唯一的键。使用上下文来获取每次文件名。

  2. 覆盖默认的分区程序,以确保所有带有前缀0_或file0_的映射输出键将转到第一个分区,全部前缀为1_或file1_的键将转到第二个键等。


  3. 在reducer中,删除x_或filex_ 前缀,并将其用作输出文件的名称(使用MultipleOutputs)。否则,如果您不需要MultipleOutputs,那么您可以通过检查分区程序代码轻松地执行outputfiles和输入文件之间的映射。 (例如,part-00000将是分区0的输出)



Please clarify

I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.

解决方案

  1. Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.

  2. Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.

  3. Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.

  4. In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)

这篇关于MapReduce一对一处理多个输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆