应用程序时间处理取决于计算节点的数量 [英] application time processing depending on the number of computing nodes

查看:77
本文介绍了应用程序时间处理取决于计算节点的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

也许这个问题有点奇怪...但是我会尝试问这个问题.

Maybe this question is a little bit strange... But I'll try to ask it.

我有一个Spark应用程序,并在不同数量的计算节点上对其进行了测试.(此计数从一节点更改为四个节点.)

I have a Spark application and I test it on a different count of computing nodes. (This count I change from one to four nodes).

所有节点都相等-它们具有相等的CPU和相等的RAM.

All nodes are equal - they have equal CPUs and equal size of RAM.

所有应用程序设置(例如并行级别或分区数)都是恒定的.

All application settings (like parallelism level or partitions count) are constantly.

以下是根据计算节点数进行的应用程序时间处理的结果:

Here the results of application time processing depending on the number of computing nodes:

1个节点-127分钟

1 node -- 127 minutes

2个节点-71分钟

3个节点-51分钟

4个节点-38分钟

结果的逼近及其随后的推断表明,时间处理随着节点数量的线性增加而呈指数下降.因此,在限制内增加节点数不会显着影响应用程序处理时间.

Approximation of the results and their subsequent extrapolation show, that time processing is exponentially decreasing with linear increasing number of nodes. So, the duration of application time processing will not be significantly affected by increasing nodes count in the limit...

有人可以解释这个事实吗?

Could anyone explain this fact?

谢谢!

推荐答案

首先,这在很大程度上取决于您的工作类型.是否受I/O约束?然后添加更多的CPU并没有太大帮助.添加更多节点将有所帮助,但是磁盘仍然限制了作业的性能.

First off, this heavily depends on the type of your job. Is it I/O bound? Then adding more CPUs won't help much. Adding more nodes will help, but still, the disks are limiting the performance of the job.

其次,对于您添加的每个节点,都会有开销,例如执行程序和任务的启动,调度等.您还可以在节点之间进行网络传输,尤其是当您的工作有多个随机播放时.

Secondly, for every node you add, there will be overhead, e.g. executor and task launching, scheduling, and so on. You also have network transfers between the nodes, especially if your job has multiple shuffles.

您还可以尝试提高并行度,以便实际上可以利用更多的节点和更多的CPU.但是通常很难实现100%的并行化,尤其是在像Spark这样的年轻项目中.

You can also try to increase parallelism so more nodes and more CPUs can actually be taken advantage of. But in general it's difficult to achieve 100% parallelization, especially in a young project like Spark.

这篇关于应用程序时间处理取决于计算节点的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆