严格别名和内存对齐 [英] strict aliasing and memory alignment
问题描述
我有性能临界code和有该分配像堆叠在函数开头的不同大小的40阵列巨大功能。大多数这些数组必须有一定的调整(因为这些阵列使用需要内存对齐(对于英特尔和ARM CPU)的CPU指令访问其他地方环比下滑。
I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).
由于海湾合作委员会的某些版本中根本无法正确对齐堆栈变量(特别是对于手臂code),甚至有时它说,对目标架构最大对齐不到什么我code实际要求,我根本没有选择,只能分配在栈上,这些阵列和手动对齐它们。
Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.
所以,每个阵列我需要做类似的东西把它对准正确:
So, for each array I need to do something like that to get it aligned properly:
short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));
这样,历史
现在的32字节边界对齐。做同样的是所有40阵列的繁琐,加上code,这部分是真正的CPU密集,我根本无法对每个阵列做同样的排列技术(这对准混乱混淆了优化和不同的寄存器分配减慢功能大的时候,为了更好的说明,请参见在解释这个问题的结束)。
This way, history
is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).
所以......很明显,我想这样做手工调整一次,并假设这些阵列位于一个又一个的权利。我还添加额外的填充到这些阵列使得它们总是多的32个字节。所以,后来我干脆在栈上创建一个巨型字符数组,并把它转换为具有所有这些对准阵列的结构体:
So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:
struct tmp
{
short history[HIST_SIZE];
short history2[2*HIST_SIZE];
...
int energy[320];
...
};
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
类似的东西。也许不是最优雅,但它产生真正的好结果,并生成装配人工检查证明,产生code是或多或少充足和可接受的。建立系统更新为使用新的GCC,突然我们开始在产生的数据的一些文物(例如,从验证测试套件输出不会再一点甚至精确纯C编译残疾ASM code)。花了很长一段时间来调试问题,它似乎是相关的重叠规则和GCC的新版本。
Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.
所以,我怎样才能得到它呢?请不要浪费时间去解释,这不是标准,不便于携带,不确定等(我读过有关的许多文章)。此外,也没有办法,我可以改变code(我或许会考虑修改GCC,以及解决问题,而不是重构code)...基本上,我要的是应用一些黑紧箍咒,使新GCC产生功能相同code为这种类型code,而不禁用优化?
So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?
编辑:的
总之,问题点......我该怎么分配的堆栈空间随机量(使用字符数组或的alloca
,然后对齐指针指向堆栈空间和reinter preT的内存这个块的一些结构,有一些明确的布局,保证某些变量的调整,只要本身是正确对齐的结构。我试图用铸各种方法记忆,我谨大堆栈分配到一个单独的功能,我还是弄不好输出和堆栈腐败,我真的开始觉得越来越多,这个巨大作用命中某种GCC中的bug,这是很奇怪的,通过这样做投我不能让这件事情做了,无论我怎么努力。顺便说一句,我禁用要求任何对齐所有优化,这是纯粹的C风格的code现在,我仍然很糟糕的结果(非bitexact对输出。偶尔堆栈腐败崩溃),修复这一切简单的修正,我写的,而不是:
In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca
, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:
char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
这code:
tmp buf;
tmp * X = &buf;
然后所有的bug消失了!唯一的问题是,这个code没有对数组做适当的调整,并启用了优化崩溃。
then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.
有趣的现象:的
我提到这个办法效果很好,并产生预期的输出结果:
Interesting observation:
I mentioned that this approach works well and produces expected output:
tmp buf;
tmp * X = &buf;
在其他一些文件我加了一个独立的noinline始终函数,只是蒙上了空指针,结构TMP *:
In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:
struct tmp * to_struct_tmp(void * buffer32)
{
return (struct tmp *)buffer32;
}
起初,我以为,如果我投用to_struct_tmp它会诱骗GCC来产生,我希望得到的结果,但alloc'ed内存,它仍然会产生无效的输出。如果我尝试修改工作code是这样的:
Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:
tmp buf;
tmp * X = to_struct_tmp(&buf);
然后我会得到相同的坏的结果! WOW,我还能说什么?也许,基于严格走样规则GCC假设 TMP * X
是不相关的 TMP BUF
和删除 TMP BUF
从to_struct_tmp返回后未使用的变量吧?还是一些奇怪的事情,产生意想不到的结果。我也试图检查生成的程序集,但是,改变 TMP * X =安培; BUF;
到 TMP * X = to_struct_tmp(安培; BUF);
产生非常不同的code的功能,因此,在某种程度上说走样规则影响code一代大的时间。
then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X
isn't related to tmp buf
and removed tmp buf
as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf;
to tmp * X = to_struct_tmp(&buf);
produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.
结论:的
各种测试后,我有一个想法,为什么我可能无法得到它的工作,无论我怎么努力。基于严格的类型走样,海湾合作委员会认为,静态数组的使用,因此它不分配堆栈。然后,局部变量也使用堆栈被写入到我的 TMP
结构存储在同一个位置;换句话说,我的巨型结构共享相同的栈存储器作为功能的其它变量。只有这可以解释为什么它总是会导致同一个坏的结果。 -fno严格走样解决这个问题,预期在这种情况下。
Conclusion:
After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp
struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.
推荐答案
如果你的问题其实都是造成有关严格别名优化,那么 -fno严格走样
将解决这个问题。此外,在这种情况下,您不必担心丢失优化,因为,的定义,的这些优化是不安全的您code和您的不能的使用它们。
Just disable alias-based optimization and call it a day
If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing
will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.
好点href=\"http://stackoverflow.com/users/241631/praetorian\">禁卫军。我记得一个开发人员的歇斯底里由海湾合作委员会推出别名分析的提示。有一定的Linux内核笔者想(A)别名的事情,和(B)仍然得到了优化。 (这是一个过于简单化,但它似乎像 -fno严格走样
将解决这个问题,花费不多,它们都必须有别的事要做。)
Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing
would solve the problem, not cost much, and they all must have had other fish to fry.)
这篇关于严格别名和内存对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!