用gcc大性能下降,可能与内联 [英] large performance drop with gcc, maybe related to inline

查看:241
本文介绍了用gcc大性能下降,可能与内联的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前遇到 GCC 一些奇怪的效果(测试版本:4.8.4)。

I'm currently experiencing some weird effect with gcc (tested version: 4.8.4).

我有一个业绩为导向code,它运行速度快pretty。它的速度依赖了很大一部分的内联许多小功能。

I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions.

由于在多个内联 .C 文件是困难的( -flto 还没有被广泛使用),我已经养了很多小功能(通常为1至5行,每行code的)共用的C文件,在其中我正在开发一个codeC,及其相关的去$ C $铬。它的相对大我的标准(约〜2000行,虽然他们中的很多都只是注释和空行),但分解成较小的部分,开辟了新的问题,所以我会preFER避免这种情况,如果是可能的。

Since inlining across multiple .c files is difficult (-flto is not yet widely available), I've kept a lot of small functions (typically 1 to 5 lines of code each) into a common C file, into which I'm developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are just comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible.

恩codeR和德codeR是相关的,因为它们是逆运算。但是,从编程的角度来看,他们完全分离,共同分享什么都没有,除了少数的typedef和非常低级别的功能(如未对齐的内存位置读取)。

Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position).

奇怪的效果,这是一种:

The strange effect is this one:

我最近增加了一个新功能 fnew 来的EN codeR一面。这是一个新的切入点。它不使用,也不叫从 .C 文件内的任何地方。

I recently added a new function fnew to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c file.

这个简单的事实,它的存在使性能的的德codeR 的函数 FDEC 大幅度下降,由20%以上,这是太多不容忽视。

The simple fact that it exists makes the performance of the decoder function fdec drops substantially, by more than 20%, which is way too much to be ignored.

现在,记住不是在编码和解码操作完全分离,并分享几乎没有,节省一些轻微的的typedef U32 U16 等等)和相关的操作(​​读/写)。

Now, keep in mind than encoding and decoding operations are completely separated, and share almost nothing, save some minor typedef (u32, u16 and such) and associated operations (read/write).

在定义新的编码功能 fnew 静态,去codeR性能 FDEC 增加恢复正常。由于 fnew 不是从 .C ,我想这是一样的,如果它不存在(被称为死code消除)。

When defining the new encoding function fnew as static, performance of the decoder fdec increases back to normal. Since fnew isn't called from the .c, I guess it's the same as if it was not there (dead code elimination).

如果静态fnew 现在从EN codeR一边叫, FDEC 的表现依然强劲。

If static fnew is now called from the encoder side, performance of fdec remains strong.

但只要 fnew 修改 FDEC 性能只是大幅度下降。

But as soon as fnew is modified, fdec performance just drops substantially.

presuming fnew 修改跨过一个门槛,我增加了以下 GCC 参数: - 参数MAX-直列的insn-AUTO = 60 (默认情况下,它的价值应该是40)和它的工作:性能 FDEC 现在恢复正常。

Presuming fnew modifications crossed a threshold, I increased the following gcc parameter: --param max-inline-insns-auto=60 (by default, its value is supposed to be 40.) And it worked : performance of fdec is now back to normal.

和我想这场比赛将永远继续与 fnew 的每个小改动或其他任何类似的,需要进一步调整。

And I guess this game will continue forever with each little modification of fnew or anything else similar, requiring further tweak.

这只是普通的怪异。没有为函数 fnew 有连锁效应的完全不相关的功能 FDEC 一些小的修改不合乎逻辑的理由,其中仅关系是要在同一文件中。

This is just plain weird. There is no logical reason for some little modification in function fnew to have knock-on effect on completely unrelated function fdec, which only relation is to be in the same file.

的只是试探性的解释,我可以迄今发明是可能的简单presence fnew 足以跨越某种全球文件阈值这将影响 FDEC fnew 可制成没有present时,它的:1.不存在,2 静态但不从任何地方3. 静态打了个电话,小到足以被内联。但它只是隐藏的问题。这是否意味着我不能添加任何新功能呢?

The only tentative explanation I could invent so far is that maybe the simple presence of fnew is enough to cross some kind of global file threshold which would impact fdec. fnew can be made "not present" when it's: 1. not there, 2. static but not called from anywhere 3. static and small enough to be inlined. But it's just hiding the problem. Does that mean I can't add any new function?

真的,我找不到任何令人满意的解释随时随地在网络上。

Really, I couldn't find any satisfying explanation anywhere on the net.

我很好奇,想知道是否有人已经经历了一些相当的副作用,并找到了解决的办法。

I was curious to know if someone already experienced some equivalent side-effect, and found a solution to it.

让我们去一些更加疯狂的考验。
现在我添加的其他的完全无用的功能,只是一起玩。其内容是严格准确 fnew 的复制粘贴,但函数的名称显然是不同的,所以我们称之为跆拳道

Let's go for some more crazy test. Now I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew, but the name of the function is obviously different, so let's call it wtf.

跆拳道存在,它并不重要,如果 fnew 是静态的还是没有,也没有什么价值 MAX-直列的insn-AUTO的 FDEC 的性能恢复正常。
尽管跆拳道没有使用,也没有从任何地方叫...:'(

When wtf exists, it doesn't matter if fnew is static or not, nor what is the value of max-inline-insns-auto: performance of fdec is back to normal. Even though wtf is not used nor called from anywhere... :'(


没有在线指令。所有功能都正常或静态。内联的决定完全是编译器的境界,这一直很好,到目前为止以内。

there is no inline instruction. All functions are either normal or static. Inlining decision is solely within compiler's realm, which has worked fine so far.


正如彼得·科德斯建议,这个问题是不相关的内联,而是指令对齐。在较新的英特尔处理器(Sandy Bridge的和更高版本),从32字节边界对齐热循环的好处。
问题是,在默认情况下, GCC 对齐它们在16字节边界。这给出了一个50%的机会要上正确对齐取决于previous code的长度。因此,一个难以理解的问题,这看起来是随机的。

As suggested by Peter Cordes, the issue is not related to inline, but to instruction alignment. On newer Intel cpus (Sandy Bridge and later), hot loop benefit from being aligned on 32-bytes boundaries. Problem is, by default, gcc align them on 16-bytes boundaries. Which gives a 50% chance to be on proper alignment depending on length of previous code. Hence a difficult to understand issue, which "looks random".

不是所有的循环是敏感的。它只事项关键回路,且仅当它们的长度让他们跨越多了一个32字节的指令段正时最好少排列。

Not all loop are sensitive. It only matters for critical loops, and only if their length make them cross one more 32-bytes instruction segment when being less ideally aligned.

推荐答案

至于我的意见到一个答案,因为它正在变成一个长时间的讨论。讨论表明,性能的问题是对准敏感

Turning my comments into an answer, because it was turning into a long discussion. Discussion showed that the performance problem is sensitive to alignment.

有链接到一些法律约束调整信息在 http://stackoverflow.com/tags/x86/info 包括英特尔的优化指南,以及瓦格纳雾的非常优异的东西。一些瓦格纳雾的装配优化建议并不完全适用于SandyBridge的和更新的CPU。如果你想要一个特定的CPU在低层次的细节,不过,microarch指南是非常不错的。

There are links to some perf-tuning info at http://stackoverflow.com/tags/x86/info, include Intel's optimization guide, and Agner Fog's very excellent stuff. Some of Agner Fog's assembly optimization advice doesn't fully apply to Sandybridge and later CPUs. If you want the low-level details on a specific CPU, though, the microarch guide is very good.

如果没有至少一个外部链接code,我可以尝试一下我自己,我不能做的比handwave多。如果你不发布code anywher,你将需要使用Linux这样 PERF 或英特尔VTune性能分析/ CPU性能计数器工具来跟踪下来在一段合理的时间。

Without at least an external link to code that I can try myself, I can't do more than handwave. If you don't post the code anywher, you're going to need to use profiling / CPU performance counter tools like Linux perf or Intel VTune to track this down in a reasonable amount of time.

在聊天中,发现OP 别人有这个问题,但与code贴这可能是OP是看到同样的问题,是SandyBridge的风格UOP缓存的主要途径code对齐问题之一。

In chat, the OP found someone else having this issue, but with code posted. This is probably the same issue the OP is seeing, and is one of the major ways code alignment matters for Sandybridge-style uop caches.

有是在慢版环路的中间32B的边界。该边界德code至5微指令之前启动的指令。所以在第一个周期中,微指令缓存中担任了 MOV /添加/ movzbl / MOV 。在第二循环中,只存在一个单一的 MOV UOP在当前高速缓存行离开。然后第三个循环周期的问题在过去的2微指令循环:添加 CMP + JA

There's a 32B boundary in the middle of the loop in the slow version. The instructions that start before the boundary decode to 5 uops. So in the first cycle, the uop cache serves up mov/add/movzbl/mov. In the 2nd cycle, there's only a single mov uop left in the current cache line. Then the 3rd cycle cycle issues the last 2 uops of the loop: add and cmp+ja.

有问题的 MOV 开始于 0x..ff 。我想,跨越边界32B进入(之一)的UOP缓存线(S),其起始地址的指令。

The problematic mov starts at 0x..ff. I guess instructions that span a 32B boundary go into (one of) the uop cacheline(s) for their starting address.

在快速版本,迭代只需要2个周期发行:同第一个周期,那么 MOV /添加/ CMP + JA 在2

In the fast version, an iteration only takes 2 cycles to issue: The same first cycle, then mov / add / cmp+ja in the 2nd.

如果前4个指令之一是一个字节长(例如,用没用preFIX或REX preFIX填充),就不会有问题。就不会有奇数人出在第一超高速缓存行的端部,因为 MOV 将在32B边界之后就成为下一个微指令高速缓存行的一部分

If one of the first 4 instructions had been one byte longer (e.g. padded with a useless prefix, or a REX prefix), there would be no problem. There wouldn't be an odd-man-out at the end of the first cacheline, because the mov would start after the 32B boundary and be part of the next uop cache line.

据我所知,组装和放大器;检查拆装输出是使用相同的指令长版本的唯一方法(见瓦格纳雾的优化大会)在4个微指令的倍数获得32B边界。我不知道一个图形用户界面,显示组装code对准你正在编辑的。 (很显然,这样做只适用于手写ASM,并且是易碎的。更改code都将打破手对齐)

AFAIK, assemble & check disassembly output is the only way to use longer versions of the same instructions (see Agner Fog's Optimizing Assembly) to get 32B boundaries at multiples of 4 uops. I'm not aware of a GUI that shows alignment of assembled code as you're editing. (And obviously, doing this only works for hand-written asm, and is brittle. Changing the code at all will break the hand-alignment.)

这也是为什么英特尔的优化指南建议重要的循环对准32B。

This is why Intel's optimization guide recommends aligning critical loops to 32B.

如果汇编有办法要求preceding说明使用较长的编码来填补到一定长度的组装这将是真的很酷。也许 .starten codealign / .enden codealign 32 对指令的,适用于填充该指令之间code,使其结束32B边界上。如果使用不好,虽然这可以使可怕code。

It would be really cool if an assembler had a way to request that preceding instructions be assembled using longer encodings to pad out to a certain length. Maybe a .startencodealign / .endencodealign 32 pair of directives, to apply padding to code between the directives to make it end on a 32B boundary. This could make terrible code if used badly, though.

更改为内联参数将改变功能的大小,以及倍数16B磕碰等code以上。这是为了改变功能的内容类似的效果:它变得更大,并改变其它函数的对准

Changes to the inlining parameter will change the size of functions, and bump other code over by multiples 16B. This is a similar effect to changing the contents of a function: it gets bigger and changes the alignment of other functions.

我期待编译器始终确保一个函数开始于
  理想的对准位置,使用空操作,以填补空白。

I was expecting the compiler to always make sure a function starts at ideal aligned position, using noop to fill gaps.

有一个权衡。它会伤害表现对准每一个功能64B(高速缓存行的开始)。 code密度会下降,与持有的指示需要更多的高速缓存行。 16B是好的,因为它是取指令/德code块大小上最新的CPU。

There's a tradeoff. It would hurt performance to align every function to 64B (the start of a cache line). Code density would go down, with more cache lines needed to hold the instructions. 16B is good, because it's the instruction fetch/decode chunk size on most recent CPUs.

瓦格纳雾为每个microarch低级别的细节。他已经没有更新它Broadwell微架构,不过,但UOP缓存可能还没有,因为SandyBridge的改变。我想有一相当小的循环,占主导地位的运行时间。我不知道到底要寻找什么第一。也许慢版本有近code的32B块的结尾部分分支目标(并因此附近的微指令缓存线的一端),导致显著每时钟低于4微指令出来前端的。

Agner Fog has the low-level details for each microarch. He hasn't updated it for Broadwell, though, but the uop cache probably hasn't changed since Sandybridge. I assume there's one fairly small loop that dominates the runtime. I'm not sure exactly what to look for first. Maybe the "slow" version has some branch targets near the end of a 32B block of code (and hence near the end of a uop cacheline), leading to significantly less than 4 uops per clock coming out of the frontend.

看性能计数器为慢与快的版本(例如,用 PERF的统计./cmd ),并查看是否有任何不同。例如更大量的高速缓存未命中可以指示线程之间的高速缓存行的伪共享。另外,个人资料,看看是否有在慢版本的新热点。 (例如,用 PERF记录./cmd&放大器;&安培; PERF的报告在Linux上)

Look at performance counters for the "slow" and "fast" versions (e.g. with perf stat ./cmd), and see if any are different. e.g. a lot more cache misses could indicate false sharing of a cache line between threads. Also, profile and see if there's a new hotspot in the "slow" version. (e.g. with perf record ./cmd && perf report on Linux).

多少微指令/时钟快版本的故事吗?如果是上述3个,前端瓶颈(也许在UOP缓存)那些对敏感对齐可能是问题。无论是如果不同的对齐意味着或L1 / UOP缓存未命中的code需求超过可用的高速缓存行。

How many uops/clock is the "fast" version getting? If it's above 3, frontend bottlenecks (maybe in the uop cache) that are sensitive to alignment could be the issue. Either that or L1 / uop-cache misses if different alignment means your code needs more cache lines than are available.

无论如何,这再次强调:使用一个分析器/性能计数器来发现新的瓶颈的慢的版本了,但快的版本没有。然后,你可以花时间看code的该块的拆卸。 (别看GCC的ASM输出。你需要看到最终的二进制的拆卸对齐)看16B和32B的界限,因为presumably他们会在两个版本之间不同的地方,我们认为这是问题的原因。

Anyway, this bears repeating: use a profiler / performance counters to find the new bottleneck that the "slow" version has, but the "fast" version doesn't. Then you can spend time looking at the disassembly of that block of code. (Don't look at gcc's asm output. You need to see the alignment in the disassembly of the final binary.) Look at the 16B and 32B boundaries, since presumably they'll be in different places between the two versions, and we think that's the cause of the problem.

对齐还可以使宏观融合失败,如果一个比较/ JCC正好将一个16B的边界。虽然这不太可能你的情况,因为你的函数总是对准16B的某个倍数。

Alignment can also make macro-fusion fail, if a compare/jcc splits a 16B boundary exactly. Although that is unlikely in your case, since your functions are always aligned to some multiple of 16B.

重:对齐自动化工具:不,我不知道任何可以看一个二进制和告诉你任何有关对齐非常有用。我希望有展示的4微指令组和32B的边界沿着你的code和更新为您编辑的编辑器。

re: automated tools for alignment: no, I'm not aware of anything that can look at a binary and tell you anything useful about alignment. I wish there was an editor to show groups of 4 uops and 32B boundaries alongside your code, and update as you edit.

英特尔IACA 能有时是分析一个循环是有用的,但IIRC它不知道所采取的树枝,我认为不具备前端,这显然是问题的复杂的模型,如果不对你打破性能。

Intel's IACA can sometimes be useful for analyzing a loop, but IIRC it doesn't know about taken branches, and I think doesn't have a sophisticated model of the frontend, which is obviously the issue if misalignment breaks performance for you.

这篇关于用gcc大性能下降,可能与内联的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆