如何在GCC的32字节边界处对齐堆栈? [英] How to align stack at 32 byte boundary in GCC?

查看:284
本文介绍了如何在GCC的32字节边界处对齐堆栈?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用基于GCC 4.6.1的Windows 64位目标的MinGW64版本。我正在玩新的英特尔AVX指令。我的命令行参数是 -march = corei7-avx -mtune = corei7-avx -mavx

但是在堆栈中分配局部变量时,我开始遇到分段错误错误。 GCC使用对齐的移动 VMOVAPS VMOVAPD 移动 __ m256 __ m256d 左右,这些指令需要32字节对齐。但是,Windows 64位的堆栈只有16个字节对齐。



如何将GCC的堆栈对齐方式更改为32个字节?

尝试使用 -mstackrealign 但无济于事,因为它只对齐到16个字节。我无法使 __属性__((force_align_arg_pointer))工作,无论如何它都对齐到16个字节。我一直没能找到解决这个问题的其他编译器选项。

编辑:
我试过使用 -mpreferred-stack-boundary = 5 ,但GCC说5不支持这个目标。我没有想法。

解决方案

我一直在探讨这个问题,提交了一份GCC错误报告,发现这是一个MinGW64相关问题。请参阅 GCC Bug#49001 。显然,GCC在Windows上不支持32字节的堆栈对齐。这有效地防止了使用256位AVX指令。



我研究了几种方法来处理这个问题。最简单和最令人满意的解决方案是通过未对齐的替代VMOVUPS等替换对齐的内存访问VMOVAPS / PD / DQA。所以我昨晚学习了Python(顺便说一句,这是一个非常好的工具),并拿出了下面的脚本来完成这项工作输入汇编文件由GCC生成:

  import re 
import fileinput
import sys

#修复对齐的堆栈访问
#通过未对齐的vmov *替换对齐的vmov *与32字节对齐的操作数
#请参阅英特尔的AVX编程指南,第39页
vmova = re.compile( (%\\ n)*(%ymm。*?\(%r)))
aligndict = {aps:ups,apd:upd,dqa:dqu};
for fileinput.FileInput(sys.argv [1:],inplace = 1) :
m = vmova.match(line)
如果m和m.group(1)在aligndict中:
s = m.group(1)
print line.replace(vmov + s,vmov+ aligndict [s]),
else:
打印行,

这个a pproach是相当安全和万无一失的。尽管我在罕见的场合观察到了表演处罚。当堆栈未对齐时,内存访问跨越缓存线边界。幸运的是,代码的执行速度与大部分时间对齐访问一样快。我的建议是:在关键循环中嵌入函数



我还尝试使用另一个Python脚本修复每个函数prolog中的堆栈分配,始终将其对齐在32字节的边界。这似乎适用于某些代码,但不适用于其他代码。我必须依靠GCC的良好意愿,它将分配对齐的局部变量(关于堆栈指针),通常它会这样做。情况并非总是如此,特别是当由于必须在函数调用之前保存所有ymm寄存器而造成严重的寄存器溢出时。 (所有的ymm寄存器都是被调用者保存的)。如果有兴趣,我可以发布脚本。



最好的解决方案是修复GCC MinGW64版本。不幸的是,我不知道它的内部运作,上周刚开始使用它。


I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.

But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.

How can I change the GCC's stack alignment to 32 bytes?

I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.

EDIT: I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.

解决方案

I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.

I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:

import re
import fileinput
import sys

# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands 
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
    m = vmova.match(line)
    if m and m.group(1) in aligndict:
        s = m.group(1)
        print line.replace("vmov"+s, "vmov"+aligndict[s]),
    else:
        print line,

This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!

I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.

The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.

这篇关于如何在GCC的32字节边界处对齐堆栈?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆