严格别名和内存对齐 [英] strict aliasing and memory alignment

查看:120
本文介绍了严格别名和内存对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有性能临界code和有该分配像堆叠在函数开头的不同大小的40阵列巨大功能。大多数这些数组必须有一定的调整(因为这些阵列使用需要内存对齐(对于英特尔和ARM CPU)的CPU指令访问其他地方环比下滑。

I have performance critical code and there is a huge function that allocates like 40 arrays of different size on the stack at the beginning of the function. Most of these arrays have to have certain alignment (because these arrays are accessed somewhere else down the chain using cpu instructions that require memory alignment (for Intel and arm CPUs).

由于海湾合作委员会的某些版本中根本无法正确对齐堆栈变量(特别是对于手臂code),甚至有时它说,对目标架构最大对齐不到什么我code实际要求,我根本没有选择,只能分配在栈上,这些阵列和手动对齐它们。

Since some versions of gcc simply fail to align stack variables properly (notably for arm code), or even sometimes it says that maximum alignment for the target architecture is less than what my code actually requests, I simply have no choice but to allocate these arrays on the stack and align them manually.

所以,每个阵列我需要做类似的东西把它对准正确:

So, for each array I need to do something like that to get it aligned properly:

short history_[HIST_SIZE + 32];
short * history = (short*)((((uintptr_t)history_) + 31) & (~31));

这样,历史现在的32字节边界对齐。做同样的是所有40阵列的繁琐,加上code,这部分是真正的CPU密集,我根本无法对每个阵列做同样的排列技术(这对准混乱混淆了优化和不同的寄存器分配减慢功能大的时候,为了更好的说明,请参见在解释这个问题的结束)。

This way, history is now aligned on 32-byte boundary. Doing the same is tedious for all 40 arrays, plus this part of code is really cpu intensive and I simply cannot do the same alignment technique for each of the arrays (this alignment mess confuses the optimizer and different register allocation slows down the function big time, for better explanation see explanation at the end of the question).

所以......很明显,我想这样做手工调整一次,并假设这些阵列位于一个又一个的权利。我还添加额外的填充到这些阵列使得它们总是多的32个字节。所以,后来我干脆在栈上创建一个巨型字符数组,并把它转换为具有所有这些对准阵列的结构体:

So... obviously, I want to do that manual alignment only once and assume that these arrays are located one right after the other. I also added extra padding to these arrays so that they are always multiple of 32 bytes. So, then I simply create a jumbo char array on the stack and cast it to a struct that has all these aligned arrays:

struct tmp
{
   short history[HIST_SIZE];
   short history2[2*HIST_SIZE];
   ...
   int energy[320];
   ...
};


char buf[sizeof(tmp) + 32];
tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));

类似的东西。也许不是最优雅,但它产生真正的好结果,并生成装配人工检查证明,产生code是或多或少充足和可接受的。建立系统更新为使用新的GCC,突然我们开始在产生的数据的一些文物(例如,从验证测试套件输出不会再一点甚至精确纯C编译残疾ASM code)。花了很长一段时间来调试问题,它似乎是相关的重叠规则和GCC的新版本。

Something like that. Maybe not the most elegant, but it produced really good result and manual inspection of generated assembly proves that generated code is more or less adequate and acceptable. Build system was updated to use newer GCC and suddenly we started to have some artifacts in generated data (e.g. output from validation test suite is not bit exact anymore even in pure C build with disabled asm code). It took long time to debug the issue and it appeared to be related to aliasing rules and newer versions of GCC.

所以,我怎样才能得到它呢?请不要浪费时间去解释,这不是标准,不便于携带,不确定等(我读过有关的许多文章)。此外,也没有办法,我可以改变code(我或许会考虑修改GCC,以及解决问题,而不是重构code)...基本上,我要的是应用一些黑紧箍咒,使新GCC产生功能相同code为这种类型code,而不禁用优化?

So, how can I get it done? Please, don't waste time trying to explain that it's not standard, not portable, undefined etc (I've read many articles about that). Also, there is no way I can change the code (I would perhaps consider modifying GCC as well to fix the issue, but not refactoring the code)... basically, all I want is to apply some black magic spell so that newer GCC produces the functionally same code for this type of code without disabling optimizations?

编辑:

 

  • 我曾经在多个操作系统/编译器这个code,但开始有问题,当我切换到它是基于GCC 4.6新NDK。我得到同样的结果不好用GCC 4.7(NDK从r8d)
     
  • 我提到的32字节对​​齐。如果它伤害你的眼睛,用你喜欢的任何其他数字代替它,例如666是否有帮助。是绝对没有点甚至提到,大多数架构不需要那么对齐。如果我对齐堆栈本地阵列8KB,我失去了16字节对齐15字节,我失去31 32字节对​​齐。我希望这是清楚我的意思。

     
  • 我说,有像性能的关键code叠40阵列。我大概还需要说,这是一个第三方的老code,它一直很好,我不想惹它。无需多言,如果它是好还是坏,没点了点。

     
  • 这code /功能以及测试和定义的行为。我们有那个$ C $的Ç例如要求确切的数字它分配XKB或RAM,使用静态表ýkb的,并且消耗最多的堆栈空间Z kb和它不能改变,因为code将不会改变。

     
  • 表示,排列混乱混淆优化我的意思是,如果我尝试调整每个数组分别code优化分配额外的寄存器对准code和性能的关键部分code突然唐'T有足够的寄存器,并开始捣毁,而不是叠加导致了code的放缓。观察上的ARM CPU这种行为(我并不担心英特尔在所有的方式)。

     
  • 将神器我的意思是输出变为非bitexact对,有增加了一些噪音。或者是因为和这种类型的混叠问题出现在编译器中的一些bug,最终导致错误的输出功能。


  • I used this code on multiple OSes/compilers, but started to have issues when I switched to newer NDK which is based on GCC 4.6. I get the same bad result with GCC 4.7 (from NDK r8d)
  • I mention 32 byte alignment. If it hurts your eyes, substitute it with any other number that you like, for example 666 if it helps. There is absolutely no point to even mention that most architectures don't need that alignment. If I align 8KB of local arrays on stack, I loose 15 bytes for 16 byte alignment and I loose 31 for 32 byte alignment. I hope it's clear what I mean.
  • I say that there are like 40 arrays on the stack in performance critical code. I probably also need to say that it's a third party old code that has been working well and I don't want to mess with it. No need to say if it's good or bad, no point for that.
  • This code/function has well tested and defined behavior. We have exact numbers of the requirements of that code e.g. it allocates Xkb or RAM, uses Y kb of static tables, and consumes up to Z kb of stack space and it cannot change, since the code won't be changed.
  • By saying that "alignment mess confuses the optimizer" I mean that if I try to align each array separately code optimizer allocates extra registers for the alignment code and performance critical parts of code suddenly don't have enough registers and start trashing to stack instead which results in a slowdown of the code. This behavior was observed on ARM CPUs (I'm not worried about intel at all by the way).
  • By artifacts I meant that the output becomes non-bitexact, there is some noise added. Either because of this type aliasing issue or there is some bug in the compiler that results eventually in wrong output from the function.

    总之,问题点......我该怎么分配的堆栈空间随机量(使用字符数组或的alloca ,然后对齐指针指向堆栈空间和reinter preT的内存这个块的一些结构,有一些明确的布局,保证某些变量的调整,只要本身是正确对齐的结构。我试图用铸各种方法记忆,我谨大堆栈分配到一个单独的功能,我还是弄不好输出和堆栈腐败,我真的开始觉得越来越多,这个巨大作用命中某种GCC中的bug,这是很奇怪的,通过这样做投我不能让这件事情做了,无论我怎么努力。顺便说一句,我禁用要求任何对齐所有优化,这是纯粹的C风格的code现在,我仍然很糟糕的结果(非bitexact对输出。偶尔堆栈腐败崩溃),修复这一切简单的修正,我写的,而不是:

    In short, the point of the question... how can I allocate random amount of stack space (using char arrays or alloca, and then align pointer to that stack space and reinterpret this chunk of memory as some structure that has some well defined layout that guarantees alignment of certain variables as long as the structure itself is aligned properly. I'm trying to cast the memory using all kinds of approaches, I move the big stack allocation to a separate function, still I get bad output and stack corruption, I'm really starting to think more and more that this huge function hits some kind of bug in gcc. It's quite strange, that by doing this cast I can't get this thing done no matter what I try. By the way, I disabled all optimizations that require any alignment, it's pure C-style code now, still I get bad results (non-bitexact output and occasional stack corruptions crashes). The simple fix that fixes it all, I write instead of:

    char buf[sizeof(tmp) + 32];
    tmp * X = (tmp*)((((uintptr_t)buf) + 31) & (~31));
    

    这code:

    tmp buf;
    tmp * X = &buf;
    

    然后所有的bug消失了!唯一的问题是,这个code没有对数组做适当的调整,并启用了优化崩溃。

    then all bugs disappear! The only problem is that this code doesn't do proper alignment for the arrays and will crash with optimizations enabled.

    有趣的现象:

    我提到这个办法效果很好,并产生预期的输出结果:

    Interesting observation:
    I mentioned that this approach works well and produces expected output:

    tmp buf;
    tmp * X = &buf;
    

    在其他一些文件我加了一个独立的noinline始终函数,只是蒙上了空指针,结构TMP *:

    In some other file I added a standalone noinline function that simply casts a void pointer to that struct tmp*:

    struct tmp * to_struct_tmp(void * buffer32)
    {
        return (struct tmp *)buffer32;
    }
    

    起初,我以为,如果我投用to_struct_tmp它会诱骗GCC来产生,我希望得到的结果,但alloc'ed内存,它仍然会产生无效的输出。如果我尝试修改工作code是这样的:

    Initially, I thought that if I cast alloc'ed memory using to_struct_tmp it will trick gcc to produce results that I expected to get, yet, it still produces invalid output. If I try to modify working code this way:

    tmp buf;
    tmp * X = to_struct_tmp(&buf);
    

    然后我会得到相同的的结果! WOW,我还能说什么?也许,基于严格走样规则GCC假设 TMP * X 是不相关的 TMP BUF 和删除 TMP BUF 从to_struct_tmp返回后未使用的变量吧?还是一些奇怪的事情,产生意想不到的结果。我也试图检查生成的程序集,但是,改变 TMP * X =安培; BUF; TMP * X = to_struct_tmp(安培; BUF); 产生非常不同的code的功能,因此,在某种程度上说走样规则影响code一代大的时间。

    then i get the same bad result! WOW, what else can I say? Perhaps, based on strict-aliasing rule gcc assumes that tmp * X isn't related to tmp buf and removed tmp buf as unused variable right after return from to_struct_tmp? Or does something strange that produces unexpected result. I also tried to inspect generated assembly, however, changing tmp * X = &buf; to tmp * X = to_struct_tmp(&buf); produces extremely different code for the function, so, somehow that aliasing rule affects code generation big time.

    结论:

    各种测试后,我有一个想法,为什么我可能无法得到它的工作,无论我怎么努力。基于严格的类型走样,海湾合作委员会认为,静态数组的使用,因此它不分配堆栈。然后,局部变量也使用堆栈被写入到我的 TMP 结构存储在同一个位置;换句话说,我的巨型结构共享相同的栈存储器作为功能的其它变量。只有这可以解释为什么它总是会导致同一个坏的结果。 -fno严格走样解决这个问题,预期在这种情况下。

    Conclusion:
    After all kinds of testing, I have an idea why possibly I can't get it to work no matter what I try. Based on strict type aliasing, GCC thinks that the static array is unused and therefore doesn't allocate stack for it. Then, local variables that also use stack are written to the same location where my tmp struct is stored; in other words, my jumbo struct shares the same stack memory as other variables of the function. Only this could explain why it always results in the same bad result. -fno-strict-aliasing fixes the issue, as expected in this case.

    推荐答案

    如果你的问题其实都是造成有关严格别名优化,那么 -fno严格走样将解决这个问题。此外,在这种情况下,您不必担心丢失优化,因为,的定义,的这些优化是不安全的您code和您的不能的使用它们。

    Just disable alias-based optimization and call it a day

    If your problems are in fact caused by optimizations related to strict aliasing, then -fno-strict-aliasing will solve the problem. Additionally, in that case, you don't need to worry about losing optimization because, by definition, those optimizations are unsafe for your code and you can't use them.

    好点href=\"http://stackoverflow.com/users/241631/praetorian\">禁卫军。我记得一个开发人员的歇斯底里由海湾合作委员会推出别名分析的提示。有一定的Linux内核笔者想(A)别名的事情,和(B)仍然得到了优化。 (这是一个过于简单化,但它似乎像 -fno严格走样将解决这个问题,花费不多,它们都必须有别的事要做。)

    Good point by Praetorian. I recall one developer's hysteria prompted by the introduction of alias analysis in gcc. A certain Linux kernel author wanted to (A) alias things, and (B) still get that optimization. (That's an oversimplification but it seems like -fno-strict-aliasing would solve the problem, not cost much, and they all must have had other fish to fry.)

    这篇关于严格别名和内存对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆