每个Hadoop映射器将读取的默认大小是多少? [英] What is the default size that each Hadoop mapper will read?

查看:135
本文介绍了每个Hadoop映射器将读取的默认大小是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HDFS的块大小是64 MB吗?是否有任何配置参数可用于更改它?



对于读取gzip文件的映射器,gzip文件的数量必须等于数字

解决方案

这是依赖于你的:


  • 输入格式 - 一些输入格式( NLineInputFormat WholeFileInputFormat )在除块之外的边界上工作尺寸。通常,从 FileInputFormat 扩展的任何内容都将使用块边界作为指南

  • 文件块大小 - 单个文件不需要具有与默认块大小相同的块大小。这是在文件上传到HDFS时设置的 - 如果未明确设置,则应用默认块大小(在上传时)。
  • 这两个 FileInputFormat 对文件后的默认/系统块大小进行的任何更改都不会对已上传的文件产生任何影响。配置属性 mapred.min.split.size mapred.max.split.size 通常默认为 1 Long.MAX_VALUE ,但是如果这在您的系统配置或作业中被覆盖,那么这将改变由每个映射器处理的数据以及生成的映射器任务数量。
  • 不可拆分的压缩(如gzip)不能由多个映射器处理,因此您将得到1 mapper per gzip file(除非你使用类似于 CombineFileInputFormat CompositeInputFormat



所以如果你有一个块大小为64m的文件,但要么处理比这个任务更多或更少的数据,那么你应该能够设置以下作业配置属性:




  • mapred.min.split.size - 大于默认值,如果您想使用更少的映射器, )丢失数据局部性(所有由单个映射任务处理的数据现在可能在2个或更多数据节点上)
  • mapred.max.split.size - 小于默认值,如果您想使用更多映射器(假设您有一个CPU密集型映射器)来处理每个文件


如果您使用的是MR2 / YARN,那么上面的属性将被弃用并替换为:


  • mapreduce。 input.fileinputformat.split.minsize

  • mapreduce.input.fileinputformat.split.maxsize


Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?

For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers?

解决方案

This is dependent on your:

  • Input format - some input formats (NLineInputFormat, WholeFileInputFormat) work on boundaries other than the block size. In general though anything extended from FileInputFormat will use the block boundaries as guides
  • File block size - the individual files don't need to have the same block size as the default blocks size. This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. Any changes to the default / system block size after the file is will have no effect in the already uploaded file.
  • The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.
  • Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat, CompositeInputFormat)

So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:

  • mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)
  • mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file

If you're using MR2 / YARN then the above properties are deprecated and replaced by:

  • mapreduce.input.fileinputformat.split.minsize
  • mapreduce.input.fileinputformat.split.maxsize

这篇关于每个Hadoop映射器将读取的默认大小是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆