Hadoop 作业的 CPU 时间意味着什么? [英] What does CPU Time for a Hadoop Job signify?

查看:24
本文介绍了Hadoop 作业的 CPU 时间意味着什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

恐怕我不了解 Map-Reduce 作业的计时结果.例如,我正在运行的作业从作业跟踪器中提供了以下结果.

I am afraid I do not understand the timing results of a Map-Reduce job. For example, a job I am running gives me the following results from the job tracker.

完成时间:1 分 39 秒

Finished in: 1mins, 39sec

CPU 时间花费(毫秒)150,460 152,030 302,490

CPU time spent (ms) 150,460 152,030 302,490

CPU 花费时间 (ms) 中的条目分别用于 Map、Reduce 和 Total.但是,CPU 时间花费"是如何衡量的,它意味着什么?这是在分配给作业的每个映射器和化简器中花费的总累积时间吗?是否可以从框架中测量其他时间,例如洗牌、排序、分区等的时间?如果是,怎么办?

The entries in CPU time spent (ms) are for Map, Reduce and Total respectively. But, then how is "CPU time spent" being measured and what does it signify? Is this the total cumulative time spent in each of the mappers and reducers assigned for the job? Is it possible to measure other times from the framework such as time for shuffle, sort, partition etc? If so, how?

困扰我的第二个问题.我在这里看到了一些帖子(Link1Link2 ) 建议在驱动程序类中使用 getTime() :

A second question which bothers me. I have seen some posts here (Link1, Link2 ) which suggest using getTime() in the driver class :

long start = new Date().getTime();
boolean status = job.waitForCompletion(true);
long end = new Date().getTime();
System.out.println("Job took "+(end-start) + "milliseconds");

这不是在执行 Job Tracker 输出中的第一个条目所提供的吗?这是必要的吗?为 hadoop 作业计时的最佳方法是什么,尤其是当我想计时 IO 时间、计算每个节点/每个阶段的时间时?

Is this not doing what the first entry in Job Tracker output provides anyway? Is this necessary? What is the best way to time a hadoop job especially when I want to time IO time, compute time per node/ per stage ?

推荐答案

map 阶段包括:record reader、map、combiner 和 partitioner.

The map phase consists of: record reader, map, combiner, and partitioner.

reduce 阶段包括:shuffle、sort、reduce、output.

The reduce phase consists of: shuffle, sort, reduce, output.

您看到的 CPU 时间包括整个映射阶段和整个缩减阶段……而不仅仅是函数本身.这是一种令人困惑的术语,因为您有 map 函数和 reduce 函数,它们只是 map phase 和 reduce phase 的一部分.这是集群中所有节点的总 CPU 时间.

The CPU time you are seeing there is of the entire map phase and the entire reduce phase... not just the function itself. This is kind of confusing terminology because you have the map function and reduce function, which are only a portion of the map phase and reduce phase. This is the total CPU time across all of the nodes in the cluster.

CPU 时间与实时有很大不同.CPU 时间是指某件事在 CPU 上花费的时间,而实时时间是您和我作为人类所经历的.想一想:假设您有相同的作业运行在相同的数据上,但是在一个 20 节点的集群上,然后是一个 200 节点的集群.总体而言,两个集群将使用相同数量的 CPU 时间,但 200 节点集群的实时运行速度将提高 10 倍.当您拥有一个同时运行大量作业的共享系统时,CPU 时间是一个有用的指标.​​

CPU time is hugely different form real time. CPU time is how much time something spent on the CPUs, while real time is what you and I experience as humans. Think about this: assume you have the same job running over the same data but on one 20 node cluster, then a 200 node cluster. Overall, the same amount of CPU time will be used on both clusters, but the 200 node cluster will run 10x faster in real time. CPU time is a useful metric when you have a shared system with lots of jobs running on it at the same time.

我不知道您将如何深入了解每个阶段的 CPU 时间.不过,使用日期计时器可能不是您想要的.

I don't know how you would dive deeper to get CPU time in each phase. Using a date timer is probably not what you are looking for though.

这篇关于Hadoop 作业的 CPU 时间意味着什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆