我如何使用hadoop处理大量小文件? [英] how can i work with large number of small files in hadoop?

查看:298
本文介绍了我如何使用hadoop处理大量小文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是hadoop的新手,我正在使用wordcount示例中的大量小文件。
它需要大量的地图任务,并导致执行速度变慢。



我如何减少地图任务的数量?



如果我的问题的最佳解决方案是将小文件捕获到更大的文件,我该如何去捕获它们?

解决方案

如果使用类似于 TextInputFormat 的东西,问题是每个文件至少有一个分割,所以地图数量的上限是文件的数量,在你的情况下,如果你有很多非常小的文件,你最终会得到许多映射器处理每个非常小的数据。

为了弥补这一点,你应该使用 CombineFileInputFormat ,它会将多个文件打包到同一个分区中(我认为这个分区大小限制),所以使用这种格式,映射器的数量将独立于数量它只取决于数据量。



您必须通过扩展 Com创建自己的输入格式bineFileInputFormt ,你可以在这里找到一个实现。一旦你定义了 InputFormat ,让我们在链接 CombinedInputFormat 中调用它,你可以告诉你的工作使用它通过这样做:

  job.setInputFormatClass(CombinedInputFormat.class); 


i am new to hadoop and i'm working with large number of small files in wordcount example. it takes a lot of map tasks and results in slowing my execution.

how can i reduce the number of map tasks??

if the best solution to my problem is catting small files to a larger file, how can i cat them?

解决方案

If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.

To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.

You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:

job.setInputFormatClass(CombinedInputFormat.class);

这篇关于我如何使用hadoop处理大量小文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆