每个 Hadoop 映射器将读取的默认大小是多少? [英] What is the default size that each Hadoop mapper will read?

查看:28
本文介绍了每个 Hadoop 映射器将读取的默认大小是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HDFS 的块大小是 64 MB 吗?有什么配置参数可以用来修改吗?

Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?

对于一个读gzip文件的mapper来说,gzip文件的个数是不是一定要等于mapper的个数?

For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers?

推荐答案

这取决于你的:

  • 输入格式 - 某些输入格式(NLineInputFormatWholeFileInputFormat)适用于块大小以外的边界.一般来说,尽管从 FileInputFormat 扩展的任何内容都将使用块边界作为指南
  • 文件块大小 - 单个文件不需要具有与默认块大小相同的块大小.这是在文件上传到 HDFS 时设置的 - 如果未明确设置,则应用默认块大小(在上传时).文件保存后对默认/系统块大小的任何更改都不会影响已上传的文件.
  • FileInputFormat这两个配置属性mapred.min.split.sizemapred.max.split.size通常默认为1Long.MAX_VALUE,但如果这在您的系统配置或您的工作中被覆盖,那么这将改变每个映射器处理的数据量,以及映射器的数量产生的任务.
  • 不可拆分的压缩 - 例如 gzip,不能​​由多个映射器处理,因此每个 gzip 文件将获得 1 个映射器(除非您使用像 CombineFileInputFormat 之类的东西,<代码>复合输入格式)
  • Input format - some input formats (NLineInputFormat, WholeFileInputFormat) work on boundaries other than the block size. In general though anything extended from FileInputFormat will use the block boundaries as guides
  • File block size - the individual files don't need to have the same block size as the default blocks size. This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. Any changes to the default / system block size after the file is will have no effect in the already uploaded file.
  • The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.
  • Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat, CompositeInputFormat)

因此,如果您有一个块大小为 64m 的文件,但想要处理的每个地图任务多于或少于此,那么您应该能够设置以下作业配置属性:

So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:

  • mapred.min.split.size - 大于默认值,如果您想使用更少的映射器,代价是(可能)丢失数据局部性(由单个映射任务处理的所有数据现在可能位于 2 个或更多数据节点上)
  • mapred.max.split.size - 小于默认值,如果你想使用更多的映射器(比如你有一个 CPU 密集型映射器)来处理每个文件
  • mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)
  • mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file

如果您使用的是 MR2/YARN,则不推荐使用上述属性并替换为:

If you're using MR2 / YARN then the above properties are deprecated and replaced by:

  • mapreduce.input.fileinputformat.split.minsize
  • mapreduce.input.fileinputformat.split.maxsize

这篇关于每个 Hadoop 映射器将读取的默认大小是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆