MapReduce如何从多个输入文件读取? [英] How does MapReduce read from multiple input files?

查看:966
本文介绍了MapReduce如何从多个输入文件读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个代码来读取数据,并使用 mapreduce 将它写入 HDFS 中。但是,当我有多个文件,我不明白它是如何处理的。

 字符串文件名= conf1.get(输出路径)的输出显示了目录的输入路径, map.input.file); 

那么它如何处理目录中的文件呢?

解决方案

为了获得输入文件路径,您可以使用上下文对象,如下所示:

  FileSplit fileSplit =(FileSplit)context.getInputSplit(); 
String inputFilePath = fileSplit.getPath()。toString();

至于如何处理多个文件:



映射器函数的几个实例在集群中的不同机器上创建。每个实例接收不同的输入文件。如果文件大于默认的dfs块大小(128 MB),那么文件会被进一步分割成更小的部分,然后分发给映射器。



因此,您可以配置输入大小通过以下两种方式被每个映射器接收:
$ b


  • 更改HDFS块大小(例如 dfs.block。
  • 设置参数 mapred.min.split.size (这可以是唯一的设置为大于HDFS块大小)



注意:
这些参数只会有效如果你的输入格式支持分割输入文件。常见的压缩编解码器(如gzip)不支持分割文件,因此这些将被忽略。


I am developing a code to read data and write it into HDFS using mapreduce. However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of

String filename = conf1.get("map.input.file");

So how does it process the files in the directory ?

解决方案

In order to get the input file path you can use the context object, like this:

FileSplit fileSplit = (FileSplit) context.getInputSplit();
String inputFilePath = fileSplit.getPath().toString();

And as for how it multiple files are processed:

Several instances of the mapper function are created on the different machines in the cluster. Each instance receives a different input file. If files are bigger than the default dfs block size(128 MB) then files are further split into smaller parts and are then distributed to mappers.

So you can configure the input size being received by each mapper by following 2 ways:

  • change the HDFS block size (eg dfs.block.size=1048576)
  • set the paramaeter mapred.min.split.size (this can be only set to larger than HDFS block size)

Note: These parameters will only be effective if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so these will be ignored.

这篇关于MapReduce如何从多个输入文件读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆