比较BFS算法的两种不同实现时了解性能细节 [英] Understanding perf detail when comparing two different implementations of a BFS algorithm

查看:185
本文介绍了比较BFS算法的两种不同实现时了解性能细节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下结果是在具有32核的计算服务器上使用perf进行测量的。我知道我的实现未经过优化,但是有意进行比较。据我了解,图算法往往具有较低的局部性,研究人员试图解决该问题。



不过,我不清楚结果。经过的时间是令人误解的。我的实现在大约10秒钟的时间内通过一个约有4mm节点的图形运行,其余时间进行了预处理。经过优化的版本使用相同的输入,并且遍历约10次,每次间隔不到一秒钟,因此实际上只是预处理时间。我不是想达到同样的目的。只需了解为什么这可能基于性能。



我看到我的页面错误率要高得多。我不确定100个为什么会这样,因为注释(据我所知)没有指向我的代码中的任何特定段...



__ gnu_cxx :: new_allocator< std :: _ List_node< int> > :: construct< int,int const&>



这似乎是在我处理图形本身时,因为我创建了链表用于邻接表。我认为这实际上可能会引起问题,因此无论如何都希望进行调查。我应该能够通过切换到锯齿状数组来改善页面错误(并希望提高性能)?



优化算法的最后一级缓存未命中率更高,我认为这可以解释低局部性的BFS /图算法的主要问题是,但性能似乎不受此影响,而我的未优化则明显更低。



然后是前端/后端循环比较两者时,在性能问题上似乎是相反的-我在前端更糟,而优化在后端更差。



我是否缺少了解明显的东西?我认为在局部性方面会出现一些明显的问题,而在看perf时会遇到问题,但是我对优化版本感到困惑。






这是我对未优化的并行BFS的实现(运行一次)...





这是使用来自基准套件的优化并行BFS(运行10次)...





在进行并行搜索之前,两者都要花大约40秒钟的时间对数据进行一次预处理。

解决方案

不幸的是,性能统计数据通常没有提供足够的信息来真正确定应用程序中的瓶颈在是。可能有两个应用程序具有根本不同的基本瓶颈,但具有非常相似的 perf stat 配置文件。例如,两个应用程序可能具有相同数量或比例的L2高速缓存未命中,但一个应用程序可能会受到这种影响的控制,而另一种方法可能几乎完全不受影响,这取决于重叠工作的数量和性质。 p>

因此,如果您尝试从这些高级计数器中进行深入分析,那么您通常只是在暗中刺伤。我们仍然可以做一些观察。您提到:


优化算法的最后一级缓存未命中率更高,我认为
可以解释BFS的主要问题/图算法
具有较低的局部性,但性能似乎不受此影响,而我未优化的
则要低得多。


首先,对于优化算法,LLC遗漏了约6.2亿次,对于您的算法,LLC遗漏了约380次,但是您在此基准测试中运行优化算法10次,而您只有一次。因此,经过优化的算法可能会丢失6200万次,而您的算法的LLC丢失次数是 6倍。是的,您的算法具有较低的LLC错误率 -但是LLC丢失的绝对数量才是性能的关键。较低的未命中率仅意味着您进行的总访问量比6倍要多:基本上,与优化版本相比,您进行的存储器访问量要多得多,这会导致命中率更高,但总未命中率更高。



所有这些都指向使用未经优化的算法访问更多的总内存,或者可能以更加不友好的缓存方式访问它。那也可以解释更多的页面错误。总体而言,这两种算法的IPC都很低,并且您的算法特别低(0.49 IPC),并且考虑到不存在分支预测问题,并且您已经将它们识别为具有局部性/内存访问问题的图算法,在等待时停滞



幸运的是,还有一种更好的方法是尝试对基于 perf stat的瓶颈进行逆向工程输出。英特尔已经开发了整个方法,它以确定真正瓶颈的方式尝试这种自上而下的分析。它并不完美,但远比查看普通的 perf stat 计数器要好得多。 VTune不是免费的,但是您可以使用Andi Kleen的 toplev 。我强烈建议您从这里开始。


The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address.

I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and traverses about 10 times with each less than a second each so it's really just pre-processing time. I'm not trying to achieve the same. Just understand why that may be based on perf.

I see my page faults are substantially higher. I'm not 100 sure why this is this the case as the annotations (from what I can tell) do not point to any specific piece of my code from mine...

__gnu_cxx::new_allocator<std::_List_node<int> >::construct<int, int const&>

This seems to be when I process the graph itself since I create linked lists for the adjacency lists. I figured this may actually cause issues and wanted to investigate anyway. I should be able to improve page faults (and hopefully performance) by switching to jagged arrays?

The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.

Then there are the front / back end cycles which seems to be the opposite in terms of performance issues when comparing the two - I'm worse in frontend and the optimized is worse in backend.

Am I missing or not understanding something obvious? I thought there would be something obvious in terms of low locality that would be of issue when looking at perf but I'm confused by the optimized version.


This is my implementation of unoptimized parallel BFS (running once)...

This is using an optimized parallel BFS from a benchmark suite (running 10 times)...

Both take about 40 seconds to pre-process the data once, before doing parallel searching.

解决方案

Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.

So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:

The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.

First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.

All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.

Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.

这篇关于比较BFS算法的两种不同实现时了解性能细节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆