hadoop集群的最佳块大小 [英] Optimal Block Size for a hadoop Cluster
问题描述
所有这些都是在20GB输入文件上执行的。
64MB - 32分钟,
128MB - 19分钟,
256MB - 15分钟,
1GB - 12.5分钟。
我应该继续进行2GB的块大小吗?如果在90GB文件上执行类似的操作,也请善意解释最佳块大小。感谢!
只有您考虑下一个:更大的块大小最大限度地减少了创建映射任务的开销,但对于非本地任务,Hadoop需要将所有块转移到远程节点(这里的网络带宽限制),那么更多的最小块大小在这里表现更好。
在你的情况下,4个节点(我假设由局域网中的交换机或路由器连接),2Gb不是问题。但答案在其他环境中并非如此,其中错误率更高。
I am working on a four node multi cluster in hadoop. I have run a series of experiments with the block sizes as follows and calculated run time as follows.
All of them are performed on 20GB input file. 64MB - 32 min, 128MB - 19 Min, 256MB - 15 min, 1GB - 12.5 min.
Should I proceed further in going for 2GB block size? Also kindly explain an optimal block size if similar operations are performed on 90GB file. Thanks!
You should test with 2Gb and compare results.
Only you consider the next: More biggest block size minimize the overhead of create maps tasks, but for non-local tasks, Hadoop need transfer all the block to the remote node (network bandwidth limit here), then more smallest block size perform better here.
In your case, 4 nodes (I assume connected by a switch or router local in a LAN), 2Gb isn't a problem. But the answer isn't true in others enviroments, which more error rate.
这篇关于hadoop集群的最佳块大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!