Hadoop作业的CPU时间是什么意思? [英] What does CPU Time for a Hadoop Job signify?

查看:437
本文介绍了Hadoop作业的CPU时间是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我担心我不明白Map-Reduce作业的时间结果。例如,我正在运行的工作为我提供了作业跟踪器的以下结果。

完成时间:1分钟,39秒

CPU花费的时间(ms)150,460 152,030 302,490



CPU花费的时间(ms)分别为Map,Reduce和Total。但是,那么如何衡量CPU时间花费并且它是什么意思呢?这是分配给这项工作的每个映射器和减速器所花费的总累积时间吗?是否有可能从框架中度量其他时间,如洗牌,排序,分区等?如果是这样,怎么样?



第二个困扰我的问题。我在这里看过一些帖子( Link1 Link2 ),它建议在驱动程序类中使用getTime():

  long start = new Date()。getTime(); 
布尔状态= job.waitForCompletion(true);
long end = new Date()。getTime();
System.out.println(Job took+(end-start)+milliseconds);

这是不是在执行Job Tracker输出中的第一项?这是必要的吗?特别是当我想要计算IO时间,计算每个节点/每个阶段的时间时,什么时候计划hadoop工作的最佳方式是什么?

解决方案

映射阶段包括:记录读取器,映射,组合器和分区器。 b $ b

reduce阶段包括:shuffle,sort,reduce,output。

您看到的CPU时间是整个映射阶段,整个缩小阶段...不仅仅是功能本身。这是一种令人困惑的术语,因为您有地图功能和缩小功能,它们只是地图阶段的一部分,并减少了阶段。这是群集中所有节点的总CPU时间。



CPU时间与实时时间有很大不同。 CPU时间是在CPU上花费了多少时间,而实时是我们作为人类所经历的。想一想:假设你有相同的作业运行相同的数据,但在一个20个节点的集群上,然后是一个200个节点的集群。总体而言,两个集群都会使用相同数量的CPU时间,但200个节点集群实时运行速度会快10倍。当你有一个同时运行大量作业的共享系统时,CPU时间是一个有用的指标。​​



我不知道你会如何深入在每个阶段获得CPU时间。使用日期计时器可能不是您要查找的内容。


I am afraid I do not understand the timing results of a Map-Reduce job. For example, a job I am running gives me the following results from the job tracker.

Finished in: 1mins, 39sec

CPU time spent (ms) 150,460 152,030 302,490

The entries in CPU time spent (ms) are for Map, Reduce and Total respectively. But, then how is "CPU time spent" being measured and what does it signify? Is this the total cumulative time spent in each of the mappers and reducers assigned for the job? Is it possible to measure other times from the framework such as time for shuffle, sort, partition etc? If so, how?

A second question which bothers me. I have seen some posts here (Link1, Link2 ) which suggest using getTime() in the driver class :

long start = new Date().getTime();
boolean status = job.waitForCompletion(true);
long end = new Date().getTime();
System.out.println("Job took "+(end-start) + "milliseconds");

Is this not doing what the first entry in Job Tracker output provides anyway? Is this necessary? What is the best way to time a hadoop job especially when I want to time IO time, compute time per node/ per stage ?

解决方案

The map phase consists of: record reader, map, combiner, and partitioner.

The reduce phase consists of: shuffle, sort, reduce, output.

The CPU time you are seeing there is of the entire map phase and the entire reduce phase... not just the function itself. This is kind of confusing terminology because you have the map function and reduce function, which are only a portion of the map phase and reduce phase. This is the total CPU time across all of the nodes in the cluster.

CPU time is hugely different form real time. CPU time is how much time something spent on the CPUs, while real time is what you and I experience as humans. Think about this: assume you have the same job running over the same data but on one 20 node cluster, then a 200 node cluster. Overall, the same amount of CPU time will be used on both clusters, but the 200 node cluster will run 10x faster in real time. CPU time is a useful metric when you have a shared system with lots of jobs running on it at the same time.

I don't know how you would dive deeper to get CPU time in each phase. Using a date timer is probably not what you are looking for though.

这篇关于Hadoop作业的CPU时间是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆