一个映射器或一个reducer来处理一个文件或目录 [英] one mapper or a reducer to process one file or directory

查看:88
本文介绍了一个映射器或一个reducer来处理一个文件或目录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Hadoop和MapReduce的新手。我在这里有一些目录和文件(每个文件10MB大,N可能是100个文件可能是压缩或未压缩的),如:
MyDir1 / file1
MyDir1 / file2
...
MyDir1 / fileN



MyDir2 / file1
MyDir2 / file2
...
MyDir3 / fileN



我想要设计一个MapReduce应用程序,其中一个映射器或缩减器将处理整个MyDir1,即我不希望MyDir1跨多个映射器分割。同样,我希望MyDir2能够完全由其他映射器/缩减器进行处理而不会分裂。



有关如何解决此问题的任何想法?我需要编写自己的InputFormat并读取输入文件吗? 解决方案

实现 FileInputFormat#isSplitable()。然后,输入文件不会被拆分,并且每个地图都会被处理一个。请注意,尽管映射器并行执行,完成作业的时间取决于处理最大输入文件的时间。另外,这可能效率不高,因为跨节点会有大量数据混洗。

  import org.apache.hadoop。 FS *。 
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable(FileSystem fs,Path file){
return false;


目前的API不允许整个目录由一个映射器处理。您可能需要编写自己的 InputFormat 。或者创建一个要处理的目录列表,并将一个目录传递给每个映射器进行处理,同样由于节点之间的数据混洗,效率不高。



来临回到reducer,它们在映射器的输出KV对上运行,而不是在输入文件/目录中运行。


I am new to Hadoop and MapReduce. I have some directory and files within this (each file 10 MB big and N could be 100. Files may be compressed or uncompressed) like: MyDir1/file1 MyDir1/file2 ... MyDir1/fileN

MyDir2/file1 MyDir2/file2 ... MyDir3/fileN

I want to design a MapReduce application where one mapper or reducer would process entire MyDir1 i.e. I dont want the MyDir1 to be split across multiple mappers. Similarly I want MyDir2 to be processed by other mapper/reducer completely without splitting.

Any idea on how to go about this? Do I need to write my own InputFormat and read the input files?

解决方案

Implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map. Note that the time to complete the job depends on the time to processes the largest input file, in spite of mappers executing in parallel. Also, this might not be efficient as there will be a lot of data shuffling across nodes.

import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}

The current API doen't allow a whole directory to be processed by a single mapper. You might have to write your own InputFormat. Or else create a list of directories to be processed and pass a single directory to each mapper to be processed, again this is not efficient because of data shuffling between nodes.

Coming back to reducers, they operate on the output KV pairs from the mappers and not the input files/directories.

这篇关于一个映射器或一个reducer来处理一个文件或目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆