预期在Hadoop 0.21.0中使用打开的文件描述符 [英] Expected consumption of open file descriptors in Hadoop 0.21.0

查看:172
本文介绍了预期在Hadoop 0.21.0中使用打开的文件描述符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定 Hadoop 0.21.0 ,该框架做了什么假设关于相对于每个单独映射的打开文件描述符的数量并减少操作?具体来说,哪些子操作会导致Hadoop在作业执行期间打开新的文件描述符或者溢出到磁盘?

(这是故意忽略使用 MultipleOutputs ,因为它非常清楚地表达了与系统。)



我的理由很简单:我希望确保为Hadoop编写的每个作业都能为每个映射器或缩减器保证有限数量的所需文件描述符。 Hadoop高兴地从编程人员那里抽象出来,这通常是一件好事,如果不是在服务器管理期间其他鞋子掉落的话。



我本来是在服务器故障上询问这个问题从集群管理方面。由于我也负责编程,所以这里的问题同样适用。

解决方案

这里有一篇文章,提供了一些有关这个问题的见解:


发生这种情况是因为使用 MultipleOutputs 类时会创建更多小文件。
假设你有50个映射器,然后假设你没有歪斜的数据,Test1将始终生成50个文件,但Test2会生成50到1000个文件(50Mappers x 20TotalPartitionsPossible),这会导致I / O。在我的基准测试中,为Test1生成了199个输出文件,为Test2生成了4569个输出文件。

这意味着,对于正常行为,映射器的数量完全等于打开的文件描述符的数量。 MultipleOutputs 显然会将此数字乘以映射器数量乘以可用分区数量。 Reducer然后照常进行,每次reduce操作产生一个文件(因此,一个文件描述符)。

然后问题变成:在溢出期间操作中,大多数这些文件都由每个映射器保持打开状态,因为输出会通过拆分高兴地进行操作。因此,可用的文件描述符的问题。



因此,当前假定的最大文件描述符限制应该是:

blockquote >

映射阶段:映射器数量*可能的总分区数量



缩小阶段:减少操作次数*总分区可能


而且,正如我们所说,那是。


Given Hadoop 0.21.0, what assumptions does the framework make regarding the number of open file descriptors relative to each individual map and reduce operation? Specifically, what suboperations cause Hadoop to open a new file descriptor during job execution or spill to disk?

(This is deliberately ignoring use of MultipleOutputs, as it very clearly screws with the guarantees provided by the system.)

My rationale here is simple: I'd like to ensure each job I write for Hadoop guarantees a finite number of required file descriptors for each mapper or reducer. Hadoop cheerfully abstracts this away from the programmer, which would normally be A Good Thing, if not for the other shoe dropping during server management.

I'd originally asked this question on Server Fault from the cluster management side of things. Since I'm also responsible for programming, this question is equally pertinent here.

解决方案

Here's a post that offers some insight into the problem:

This happens because more small files are created when you use MultipleOutputs class. Say you have 50 mappers then assuming that you don't have skewed data, Test1 will always generate exactly 50 files but Test2 will generate somewhere between 50 to 1000 files (50Mappers x 20TotalPartitionsPossible) and this causes a performance hit in I/O. In my benchmark, 199 output files were generated for Test1 and 4569 output files were generated for Test2.

This implies that, for normal behavior, the number of mappers is exactly equivalent to the number of open file descriptors. MultipleOutputs obviously skews this number by the number of mappers multiplied by the number of available partitions. Reducers then proceed as normal, generating one file (and thus, one file descriptor) per reduce operation.

The problem then becomes: during a spill operation, most of these files are being held open by each mapper as output is cheerfully martialled by split. Hence the available file descriptors problem.

Thus, the currently-assumed, maximum file descriptor limit should be:

Map phase: number of mappers * total partitions possible

Reduce phase: number of reduce operations * total partitions possible

And that, as we say, is that.

这篇关于预期在Hadoop 0.21.0中使用打开的文件描述符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆