映射群集上的作业性能 [英] Map Job Performance on cluster
问题描述
较小的集群可能只有一个交换机,因此数据是网络/框架本地的,因此您可能遇到的唯一瓶颈是从一个TaskTracker到另一个,因为JobTracker可能会将作业分配给所有可用的TaskTracker。
但是对于较大的hadoop集群,复制因子= 1将限制TaskTracker节点的数量本地数据,因此能够有效地操作您的数据。
有几篇论文支持数据局部性, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf ,您引用的这篇论文也是s支持数据本地化, http://assured-cloud-computing.illinois .edu / sites / default / files / PID1974767.pdf ,而这一个, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (它测试了5节点集群,与OP相同)。
本文引用了数据局部性的重要优点, http:// grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf ,并且观察到复制因子的增加会带来更好的局部性。
注意这个论文声称网络吞吐量和本地磁盘访问之间几乎没有差异(8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf ,但报告本地内存访问与磁盘或网络访问之间的性能差异数量级。 Furhtermore指出,很大一部分工作(64%)发现他们的数据缓存在内存中很大程度上是由于工作负载的严重性,因为大多数作业只能访问块的一小部分。
Suppose I have 15 blocks of data and two clusters. The first cluster has 5 nodes and a replication factor is 1, while the second one has a replication factor is 3. If I run my map job, should I expect any change in the performance or the execution time of the map job?
In other words, how does replication affect the performance of the mapper on a cluster?
When the JobTracker assigns a job to a TaskTracker on HDFS, a job is assigned to a particular node based upon locality of data (preference is same node first, then same network switch/frame). By having different replication factors, you limit the ability for the JobTracker to assign a node local to the data (JobTracker will still assign the task nodes, but without the benefits of locality). The effect is to restrict the number of TaskTracker nodes which are both local to the data (either data on task node, or data on same switch frame), thus affecting performance for work on your task (reducing parallelization).
Your smaller cluster likely has a single switch, so data is local to the network/frame, so the only bottleneck you might experience would be data transfer from one TaskTracker to another, as the JobTracker is likely to assign jobs to all available TaskTrackers.
But with a larger hadoop cluster, the replication factor = 1 would limit the number of TaskTracker nodes local to the data and thus able to efficiently operate on your data.
There are several papers which support data locality, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf, this paper which you cited also supports data locality, http://assured-cloud-computing.illinois.edu/sites/default/files/PID1974767.pdf, and this one, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (which tested a 5 node cluster, same as the OP).
This paper quotes significant benefits to data locality, http://grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf, and observes that an increase in replication factor gives better locality.
Note that this paper claims little difference between network throughput and local disk access (8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf, but reports orders of magnitude difference in performance between local memory access and either disk or network access. Furhtermore, the paper quotes a large fraction of jobs (64%) find their data cached in memory "in large part due to the heavy-tailed nature of the workload", as most jobs "access only a small fraction of the blocks".
这篇关于映射群集上的作业性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!