Hadoop:对小文件使用CombineFileInputFormat是否可以提高性能? [英] Hadoop: Does using CombineFileInputFormat for small files gives performance improvement?
问题描述
我是hadoop的新手,并在本地计算机上执行一些测试。
有许多解决方案可以处理许多小文件。我正在使用延伸 CombineFileInputFormat 的 CombinedInputFormat 。 我发现mapper的数量已从100更改为25 CombinedInputFormat 。我还应该预计,自从Mapper数量减少后,性能会有所提高吗? 我已经在许多小文件上执行map-reduce作业 CombinedInputFormat : 100个映射器花了10分钟 但是,当map-reduce作业 strong> CombinedInputFormat : 25个映射器需要33分钟。 任何帮助将不胜感激。 b $ b Hadoop在少量大文件的情况下性能更好,而不是大量的小文件。 (这里的小意味着比Hadoop分布式文件系统(HDFS)块小得多)数字表示范围为1000s)。 这意味着如果您有1000个1Mb基于普通 在具有资源限制的多租户群集中,获取大量Map插槽也很困难。 请参阅链接了解更多详情和基准测试结果。 I am new to hadoop and peforming some tests on local machine. There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat. I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced? I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes. Any help will be appreciated. Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s). That means if you have 1000 1Mb size file the Map-reduce job based on normal In a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult. Please refer this link for more details and Benchmark results. 这篇关于Hadoop:对小文件使用CombineFileInputFormat是否可以提高性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
TextInputFormat
的Map-reduce作业将创建1000个地图任务,这些地图任务中的每一个都需要一定的时间来开始和结束。任务创建中的延迟可能会降低作业的性能。
TextInputFormat
will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the job