映射群集上的作业性能 [英] Map Job Performance on cluster

查看:153
本文介绍了映射群集上的作业性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有15个数据块和两个群集。第一个群集有5个节点,复制因子是1,而第二个群集的复制因子是3.如果我运行我的地图作业,我应该期望地图作业的性能或执行时间有任何变化吗?

解决方案当JobTracker将作业分配给HDFS上的TaskTracker时,根据数据的局部性将作业分配给特定节点(首选项是同一节点,然后是同一网络切换/帧)。通过具有不同的复制因素,您可以限制JobTracker为数据分配本地节点的能力(JobTracker仍将分配任务节点,但没有本地化的好处)。其效果是限制数据本地的TaskTracker节点的数量(任务节点上的数据或同一交换机框架上的数据),从而影响工作的性能(减少并行化)。



较小的集群可能只有一个交换机,因此数据是网络/框架本地的,因此您可能遇到的唯一瓶颈是从一个TaskTracker到另一个,因为JobTracker可能会将作业分配给所有可用的TaskTracker。



但是对于较大的hadoop集群,复制因子= 1将限制TaskTracker节点的数量本地数据,因此能够有效地操作您的数据。



有几篇论文支持数据局部性, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf ,您引用的这篇论文也是s支持数据本地化, http://assured-cloud-computing.illinois .edu / sites / default / files / PID1974767.pdf ,而这一个, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (它测试了5节点集群,与OP相同)。



本文引用了数据局部性的重要优点, http:// grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf ,并且观察到复制因子的增加会带来更好的局部性。



注意这个论文声称网络吞吐量和本地磁盘访问之间几乎没有差异(8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf ,但报告本地内存访问与磁盘或网络访问之间的性能差异数量级。 Furhtermore指出,很大一部分工作(64%)发现他们的数据缓存在内存中很大程度上是由于工作负载的严重性,因为大多数作业只能访问块的一小部分。


Suppose I have 15 blocks of data and two clusters. The first cluster has 5 nodes and a replication factor is 1, while the second one has a replication factor is 3. If I run my map job, should I expect any change in the performance or the execution time of the map job?

In other words, how does replication affect the performance of the mapper on a cluster?

解决方案

When the JobTracker assigns a job to a TaskTracker on HDFS, a job is assigned to a particular node based upon locality of data (preference is same node first, then same network switch/frame). By having different replication factors, you limit the ability for the JobTracker to assign a node local to the data (JobTracker will still assign the task nodes, but without the benefits of locality). The effect is to restrict the number of TaskTracker nodes which are both local to the data (either data on task node, or data on same switch frame), thus affecting performance for work on your task (reducing parallelization).

Your smaller cluster likely has a single switch, so data is local to the network/frame, so the only bottleneck you might experience would be data transfer from one TaskTracker to another, as the JobTracker is likely to assign jobs to all available TaskTrackers.

But with a larger hadoop cluster, the replication factor = 1 would limit the number of TaskTracker nodes local to the data and thus able to efficiently operate on your data.

There are several papers which support data locality, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf, this paper which you cited also supports data locality, http://assured-cloud-computing.illinois.edu/sites/default/files/PID1974767.pdf, and this one, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (which tested a 5 node cluster, same as the OP).

This paper quotes significant benefits to data locality, http://grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf, and observes that an increase in replication factor gives better locality.

Note that this paper claims little difference between network throughput and local disk access (8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf, but reports orders of magnitude difference in performance between local memory access and either disk or network access. Furhtermore, the paper quotes a large fraction of jobs (64%) find their data cached in memory "in large part due to the heavy-tailed nature of the workload", as most jobs "access only a small fraction of the blocks".

这篇关于映射群集上的作业性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆