Hadoop把性能 - 大文件(20GB) [英] Hadoop put performance - large file (20gb)
问题描述
我使用hdfs -put将大型20GB文件加载到hdfs中。目前该流程运行@ 4分钟。我试图提高将数据加载到hdfs的写入时间。我试着利用不同的块大小来提高写入速度,但得到了下面的结果:
I'm using hdfs -put to load a large 20GB file into hdfs. Currently the process runs @ 4mins. I'm trying to improve the write time of loading data into hdfs. I tried utilizing different block sizes to improve write speed but got the below results:
512M blocksize = 4mins;
256M blocksize = 4mins;
128M blocksize = 4mins;
64M blocksize = 4mins;
有谁知道瓶颈可能是什么以及我可以探索的其他选项来改善输出的性能cmd?
Does anyone know what the bottleneck could be and other options I could explore to improve performance of the -put cmd?
推荐答案
The core problem is that 20GB is a decent amount of data and that data getting pushed into HDFS as a single stream. You are limited by disk I/O which is pretty lame given you have a large number of disks in a Hadoop cluster.. You've got a while to go to saturate a 10GigE network (and probably a 1GigE, too).
核心问题是20GB的数据量很大,作为单一流被推入HDFS。磁盘I / O限制了你在Hadoop集群中拥有大量磁盘的情况,这是非常糟糕的。你需要一段时间才能使10GigE网络饱和(也可能是1GigE)。 p>
如您所见,更改块大小不应改变此行为。它仍然是相同数量的磁盘到HDFS的数据。
Changing block size shouldn't change this behavior, as you saw. It's still the same amount of data off disk into HDFS.
我建议你将文件分成1GB文件并分散到多个磁盘上,然后用 -put
并行。如果网络成为瓶颈,您甚至可能想要考虑将这些文件分割成多个节点。您是否可以改变收到数据的方式以使其更快?明显地分割文件并移动它也需要时间。
I suggest you split the file up into 1GB files and spread them over multiple disks, then push them up with -put
in parallel. You might want even want to consider splitting these files over multiple nodes if network becomes a bottleneck. Can you change the way you receive your data to make this faster? Obvious splitting the file and moving it around will take time, too.
这篇关于Hadoop把性能 - 大文件(20GB)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!