使用Hadoop处理大量小文件 [英] Processing large set of small files with Hadoop

查看:130
本文介绍了使用Hadoop处理大量小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Hadoop示例程序WordCount来处理大量小文件/网页(cca。2-3 kB)。由于这与hadoop文件的最佳文件大小相距甚远,因此该程序非常慢。我想这是因为设置和裁员的成本远远大于工作本身。这样的小文件也会导致文件名的命名空间耗尽。

我读过这种情况,我应该使用HDFS压缩文件(HAR),但我不确定如何修改此程序的WordCount以从此压缩文件中读取。程序能否继续工作而不需要修改或者需要修改?

即使我在档案中打包了很多文件,问题依然存在,如果这会提高性能。我读过,即使我打包了多个文件,一个存档内的这些文件也不会被一个映射器处理,但是很多,在我的情况下(我猜)不会提高性能。



如果这个问题太简单了,请明白我是Hadoop的新手,并且很少有经验。

>解决方案

使用HDFS不会改变导致hadoop处理大量小文件的问题。在这种情况下,最好的选择可能是把 cat 这些文件合并成一个(或几个大的)文件。
这将减少您拥有的映射器的数量,这将减少需要处理的事物的数量。



使用HDFS可以提高性能,如果你正在分布式系统上运行。如果你只做虚拟分发(一台机器),那么HDFS不会提高性能。限制是机器。



在大量小文件上操作时,需要大量的映射器和缩减器。 setup / down可以与文件本身的处理时间相媲美,造成大量开销。 cat 这些文件应该会减少hadoop为这项工作运行的映射程序的数量,这应该会提高性能。



您可以从使用HDFS存储文件时获得的好处将以分布式模式与多台机器一起使用。这些文件将跨机器以块(默认为64MB)存储,并且每台机器都能够处理驻留在机器上的数据块。这减少了网络带宽的使用,因此它不会成为处理中的瓶颈。



如果hadoop将要解除存档,只会导致hadoop仍然存在大量的小文件。



希望这有助于您的理解。


I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.

I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?

Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.

If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.

解决方案

Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s). This will reduce the number of mappers you have, which will reduce the number of things required to be processed.

To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.

When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.

The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.

Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.

Hope this helps your understanding.

这篇关于使用Hadoop处理大量小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆