如何编写编译器可以有效地编译为SSE或AVX的C ++代码? [英] How to write c++ code that the compiler can efficiently compile to SSE or AVX?

查看:136
本文介绍了如何编写编译器可以有效地编译为SSE或AVX的C ++代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有一个用c ++编写的函数,该函数对许多向量执行矩阵向量乘法.它需要指向要转换的向量数组的指针.我是否可以正确地假设编译器无法在编译时有效地优化SIMD指令,因为它在编译时不知道所传递指针的对齐方式(SSE需要16字节对齐或AVX需要32字节对齐)?还是数据的内存对齐方式与最佳SIMD代码无关,并且数据对齐方式只会影响缓存性能?

Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance?

如果对齐对于所生成的代码很重要,我如何让(Visual c ++)编译器知道我打算仅将具有特定对齐方式的值传递给该函数?

If alignment is important for the generated code, how can I let the (visual c++) compiler know that I intend to only pass values with a certain alignment to the function?

推荐答案

从理论上讲,自Nehalem以来,英特尔处理器上的对齐方式无关紧要.因此,您的编译器应该能够生成其中指针对齐与否不成问题的代码.

自Nehalem以来,未对齐的加载/存储指令在Intel处理器上具有相同的性能.但是,直到AVX与Sandy Bridge一起到达之前,未对齐的负载才能用微操作融合的另一种操作来折叠.

Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.

另外,即使在AVX之前,避免具有16字节对齐的内存的高速缓存行拆分的惩罚仍然可能是有帮助的,因此对于编译器来说,在指针对齐16字节之前添加代码仍然是合理的.

Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.

自从AVX开始,使用对齐的加载/存储指令就不再具有优势了,并且编译器没有理由添加代码以使指针对齐16字节或32字节..

Since AVX there is no advantage to using aligned load/store instructions anymore and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned..

但是,直到现在,仍然有理由使用对齐的内存来避免AVX导致的缓存行拆分.因此,即使编译器仍使用未对齐的加载指令,添加代码以使指针32字节对齐也是合理的.

However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.

因此,实际上,某些编译器在被告知假定指针已对齐时会生成简单得多的代码.

So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.

我不知道告诉MSVC指针已对齐的方法.使用GCC和Clang(自3.6起),您可以使用内置的__builtin_assume_aligned.对于ICC和GCC,您可以使用#pragma omp simd aligned.使用ICC,您还可以使用__assume_aligned.

I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned. With ICC and also GCC you can use #pragma omp simd aligned. With ICC you can also use __assume_aligned.

例如,使用GCC编译此简单循环

For example with GCC compiling this simple loop

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 16);
    //b = (float*)__builtin_assume_aligned (b, 16);
    for(int i=0; i<(n & (-4)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

使用

,然后wc test.s给出160行.而如果使用__builtin_assume_aligned,则wc test.s仅给出45行.当我在两种情况下都执行此操作时,clang返回110行.

with gcc -O3 -march=nehalem -S test.c and then wc test.s gives 160 lines. Whereas if use __builtin_assume_aligned then wc test.s gives only 45 lines. When I did this with in both cases clang return 110 lines.

因此,在通过clang通知编译器时,对齐的数组没有什么区别(在这种情况下),但是使用GCC可以.计算行数不足以衡量性能,但是我不打算在此处发布所有程序集,我只是想说明一下,如果告诉编译器对齐数组,则编译器可能会产生完全不同的代码.

So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.

当然,GCC在不假设阵列对齐的情况下所产生的额外开销在实践中可能没有任何区别.您必须测试并查看.

Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.

无论如何,如果您想从SIMD中获得最大收益,我将不依赖编译器正确地执行此操作(尤其是使用MSVC).您的matrix*vector示例通常比较差(但在某些特殊情况下可能不是),因为它限制了内存带宽.但是,如果选择matrix*matrix,没有大量不符合C ++标准的帮助,编译器将无法很好地进行优化.在这些情况下,您将需要内部函数/内置组件/程序集,无论如何它们都应具有对对齐方式的显式控制.

In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.

GCC的程序集包含许多多余的行,这些行不是文本段的一部分.进行gcc -O3 -march=nehalem -S test.c然后使用objdump -d并对文本(代码)段中的行进行计数将得到108行,而不使用__builtin_assume_aligned,并且仅使用16行.这更清楚地表明,当GCC假定阵列对齐时,它们会产生完全不同的代码.

The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c and then using objdump -d and counting the lines in the text (code) segment gives 108 lines without using __builtin_assume_aligned and only 16 lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.

我继续在上面的MSVC 2013中测试了foo函数.它会产生未对齐的负载,并且代码比GCC短(我仅在此处显示主循环):

I went ahead and tested the foo function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):

$LL3@foo:
    movsxd  rax, r9d
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rax*4]
    vmovups XMMWORD PTR [r11+rax*4], xmm1
    lea eax, DWORD PTR [r9+4]
    add r9d, 8
    movsxd  rcx, eax
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
    vmovups XMMWORD PTR [r11+rcx*4], xmm1
    cmp r9d, edx
    jl  SHORT $LL3@foo

自Nehalem(2008年末)以来,这对处理器来说应该没问题.但是MSVC仍然具有用于数组的清理代码,该代码不是四的倍数,甚至以为我告诉编译器它是四的倍数((n & (-4)).至少GCC做到了.

This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)). At least GCC gets that right.

由于AVX可以折叠未固定的负载,所以我用AVX检查了GCC,以查看代码是否相同.

Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 32);
    //b = (float*)__builtin_assume_aligned (b, 32);
    for(int i=0; i<(n & (-8)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

不带__builtin_assume_aligned的GCC会产生168条装配线,并且只产生17条线.

without __builtin_assume_aligned GCC produces 168 lines of assembly and with it it only produces 17 lines.

这篇关于如何编写编译器可以有效地编译为SSE或AVX的C ++代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆