应用程序性能与峰值性能 [英] Application performance vs Peak performance

查看:104
本文介绍了应用程序性能与峰值性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对群集上运行的实际应用程序性能与群集峰值性能有疑问.

I have questions about real application performance running on a cluster vs cluster peak performance.

假设一个HPC集群报告其峰值性能为1 Petaflops.这是如何计算的? 在我看来,似乎有两个测量矩阵.一种是基于硬件计算的性能.另一个是从运行HPL吗?我的理解正确吗? 当我阅读一个在系统上全面运行的真实应用程序时,开发人员提到它可以实现10%的峰值性能.如何测量它以及为什么不能达到最佳性能?

Let's say one HPC cluster report that it has peak performance of 1 Petaflops. How is this calculated? To me, it seems that there are two measuring matrixes. One is the performance calculated based on the hardware. The other one is from running HPL? Is my understanding correct? When I am reading one real application running on the system at full scale, the developer mentions that it could achieve 10% of the peak performance. How is this measured and why it can't achieve peak performance?

谢谢

推荐答案

峰值性能是系统理论上能够提供的性能.它是CPU内核总数,内核时钟频率以及每个时钟周期一个内核产生的FLOP数量的乘积.在实践中永远无法达到该性能,因为没有真正的应用程序由100%完全矢量化的紧密循环组成,这些循环仅对L1数据高速缓存中保存的数据起作用.在许多情况下,数据甚至都无法容纳在最后一级的高速缓存中,并且内存接口通常不够快,无法以CPU能够处理数据的相同速率来传送数据. HPC的一个普遍示例是稀疏矩阵与矢量的乘积.它占用大量内存(即每个算术运算需要很多加载和存储),因此在许多平台上只能达到峰值性能的一小部分.

Peak performance is what the system is theoretically able to deliver. It is the product of the total number of CPU cores, the core clock frequency, and the number of FLOPs one core makes per clock tick. That performance can never be reached in practice because no real application consists of 100% fully vectorised tight loops that only operate on data held in the L1 data cache. In many cases data doesn't even fit in the last-level cache and the memory interface is usually not fast enough to deliver data at the same rate at which the CPU is able to process it. One ubiquitous example from HPC is the multiplication of a sparse matrix with a vector. It is so memory intensive (i.e. many loads and stores per arithmetic operation) that on many platforms it only achieves a fraction of the peak performance.

当多个节点大规模地联网在一起时,情况变得更糟,因为数据传输可能会带来巨大的额外延迟.在这些情况下,性能主要取决于本地数据处理和数据传输的比率. HPL在这方面特别出色-它执行了大量矢量化的本地处理,并且不会在CPU/节点之间移动太多数据.在许多现实世界中的并行程序中,这不是事实,也是当今许多人质疑HPL在评估集群性能方面的适用性的原因.其他基准已经出现,例如 HPCG基准(来自带来HPL的人).

Things get even worse when multiple nodes are networked together on a massive scale as data transfers could introduce huge additional delays. Performance in those cases is determined mainly by the ratio of local data processing and data transfer. HPL is a particularly good in that aspect - it does a lot of vectorised local processing and does not move much data across the CPUs/nodes. That's not the case with many real-world parallel programs and also the reason why many are questioning the applicability of HPL in assessing cluster performance nowadays. Alternative benchmarks are already emerging, for example the HPCG benchmark (from the people who brought you HPL).

这篇关于应用程序性能与峰值性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆