Hadoop:对小文件使用CombineFileInputFormat是否可以提高性能? [英] Hadoop: Does using CombineFileInputFormat for small files gives performance improvement?

查看:198
本文介绍了Hadoop:对小文件使用CombineFileInputFormat是否可以提高性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是hadoop的新手,并在本地计算机上执行一些测试。



有许多解决方案可以处理许多小文件。我正在使用延伸 CombineFileInputFormat CombinedInputFormat

我发现mapper的数量已从100更改为25 CombinedInputFormat 。我还应该预计,自从Mapper数量减少后,性能会有所提高吗?

我已经在许多小文件上执行map-reduce作业 CombinedInputFormat 100个映射器花了10分钟



但是,当map-reduce作业 strong> CombinedInputFormat 25个映射器需要33分钟



任何帮助将不胜感激。 b $ b

解决方案

Hadoop在少量大文件的情况下性能更好,而不是大量的小文件。 (这里的小意味着比Hadoop分布式文件系统(HDFS)块小得多)数字表示范围为1000s)。

这意味着如果您有1000个1Mb基于普通 TextInputFormat 的Map-reduce作业将创建1000个地图任务,这些地图任务中的每一个都需要一定的时间来开始和结束。任务创建中的延迟可能会降低作业的性能。

在具有资源限制的多租户群集中,获取大量Map插槽也很困难。



请参阅链接了解更多详情和基准测试结果。

I am new to hadoop and peforming some tests on local machine.

There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat.

I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced?

I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes

But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes.

Any help will be appreciated.

解决方案

Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s).

That means if you have 1000 1Mb size file the Map-reduce job based on normal TextInputFormat will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the job

In a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult.

Please refer this link for more details and Benchmark results.

这篇关于Hadoop:对小文件使用CombineFileInputFormat是否可以提高性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆