你如何使用GNU C Vector Extensions加载/存储/打包一个双精度数组? [英] How do you load/store from/to an array of doubles with GNU C Vector Extensions?

查看:219
本文介绍了你如何使用GNU C Vector Extensions加载/存储/打包一个双精度数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 GNU C矢量扩展,而不是英特尔的 _mm _ * intrinsics。



我想和Intel的 _m256_loadu_pd 内在的。逐个赋值是很慢的:gcc生成的代码有4条加载指令,而不是一个 vmovupd (其中 _m256_loadu_pd
int main(int argc,char ** argv){
double a [4] = {1.0,2.0,3.0,4.0};
vector v;

/ *我目前这样做* /
v [0] = a [0];
v [1] = a [1];
v [2] = a [2];
v [3] = a [3];
}

我想要这样的东西:

  v =(vector)(a); 

  v = *((vector *)(a)); 

但都无效。第一个失败无法将值转换为矢量,而第二个导致段错误。

解决方案

update:我看到您使用的是GNU C的本地向量语法,而不是英特尔内在函数。您是否避免使用可以移植到非x86的英特尔内核? gcc目前做的不好,编译的代码使用的GNU C向量比目标机器支持的要宽。 (你会希望它会使用两个128b向量,并分别对每个向量进行操作,但显然它比这更糟糕。)

无论如何,这个答案显示如何使用Intel x86内在函数将数据加载到GNU C矢量语法类型中




首先,如果你试图了解编译成好代码的东西,那么看看编译器输出小于 -O2 是浪费时间。您的 main()会在-O2处优化为 ret



除此之外,您一次只能分配一个元素的元素会导致不好的asm并不令人惊讶。



$ b $除此之外:普通人会调用类型 v4df (4 Double Float的向量)或其他东西,而不是 vector ,所以当它与C ++ std :: vector 一起使用时,它们不会变得疯狂。对于单精度, v8sf 。 IIRC,gcc在内部使用类似于 __ m256d 的类型名称。
$ b 在x86上,英特尔内在类型(如 __ m256d )在GNU C矢量语法之上实现(这就是为什么您可以在GNU C中执行 v1 * v2 而不是编写 _mm256_mul_pd(v1,v2))。您可以自由地从 __ m256d 转换为 v4df ,就像我在这里完成的。



我已经在函数中包装了两种理智的方法,所以我们可以看看它们的asm 。注意我们不是从我们在同一个函数内定义的数组中加载,所以编译器不会优化它。

HREF =http://gcc.godbolt.org/#compilers:!((compiler:g6,options:'-xc+-std%3Dgnu11+-Wall+-Wextra++-O3+-fverbose-asm+-march%3Dhaswell',source: 的typedef +双+ v4df + __属性__((vector_size(4 + * +的sizeof(双))))%3B%0A%0A // +可怕+ ASM%0Av4df + load_4_doubles(常量+双+ * p)+%7B% 0A ++ v4df + RETVAL +%3D +%7BP%5B0%5D,+ p%5B1%5D,+ p%5B2%5D,+ p%5B3%5D%7D%3B%0A ++返回+ RETVAL%3B%0A %7D%0A%0A%23include +%3Cimmintrin.h%3E%0A%0Av4df + load_4_doubles_intel(常量+双+ * p)+%7B +回报+ _mm256_loadu_pd(p)%3B +%7D%0A%0Av4df + avx_constant()+ %7B +返回+ _mm256_setr_pd(+ 1.0,+ 2.0,+ 3.0,+ 4.0 +)%3B +%7D%0A')),filterAsm:(commentOnly:吨,指令:!吨,英特尔:吨,标签: !),版本:3rel =nofollow noreferrer> Godbolt编译器资源管理器,所以你可以看看各种编译选项和编译器版本的asm。

  typedef double v4df __at贡献__((vector_size(4 * sizeof(double)))); 

#include< immintrin.h>

//注意返回类型。 gcc6.1编译时没有警告,甚至在-Wall -Wextra
v4df load_4_doubles_intel(const double * p){return _mm256_loadu_pd(p); }
vmovupd ymm0,YMMWORD PTR [rdi]#tmp89,* p
ret

v4df avx_constant(){return _mm256_setr_pd(1.0,2.0,3.0,4.0); }
vmovapd ymm0,YMMWORD PTR .LC0 [rip]
ret

如果 _mm_set * intrinsics的参数不是编译时常量,编译器会尽其所能使高效的代码将所有元素合并为一个向量即可。通常最好这样做,而不是写C存储到tmp数组并加载它,因为这并不总是最好的策略。 (多个窄存储转发到一个大负载的存储转发失败在通常的存储转发延迟之上花费额外的约10个周期(IIRC)的延迟。如果 double s已在寄存器中,通常最好将它们混洗在一起。)

https://stackoverflow.com/questions/11759791/is-it-possible-to-cast-floats-directly-to-m128-if-they-are-16-byte-alligned/11766098#11766098\">是否有可能如果将16个字节分配给__m128,则将其直接投射到__m128?以获取用于将单个标量获取到矢量中的各种内在函数的列表。 x86 标记wiki的问题链接到英特尔的手册,以及他们的intrinsics finder。






加载/存储没有Intel内部函数的GNU C矢量:

我不确定你如何应该这样做。 这个Q& A 建议投射一个指向你记忆的指针(vector_size(16),aligned(1)))unaligned_byte16; (注意对齐(1)属性)。



您从 *(v4df *)a ,因为大概 a 未在32字节的边界上对齐,但是您使用的是假定为自然对齐的矢量类型。 (就像 __ m256d 一样,如果您取消引用指针而不是使用加载/存储内在函数将对齐信息传递给编译器。)


I'm using GNU C Vector Extensions, not Intel's _mm_* intrinsics.

I want to do the same thing as Intel's _m256_loadu_pd intrinsic. Assigning the values one by one is slow: gcc produces code that has 4 load instructions, rather than one single vmovupd (which _m256_loadu_pd does generate).

typedef double vector __attribute__((vector_size(4 * sizeof(double))));

int main(int argc, char **argv) {
    double a[4] = {1.0, 2.0, 3.0, 4.0};
    vector v;

    /* I currently do this */
    v[0] = a[0];
    v[1] = a[1];
    v[2] = a[2];
    v[3] = a[3];
}

I want something like this:

v = (vector)(a);

or

v = *((vector*)(a));

but neither work. The first fails with "can't convert value to a vector" while the second results in segfaults.

解决方案

update: I see you're using GNU C's native vector syntax, not Intel intrinsics. Are you avoiding Intel intrinsics for portability to non-x86? gcc currently does a bad job compiling code that uses GNU C vectors wider than the target machine supports. (You'd hope that it would just use two 128b vectors and operate on each separately, but apparently it's worse than that.)

Anyway, this answer shows how you can use Intel x86 intrinsics to load data into GNU C vector-syntax types


First of all, looking at compiler output at less than -O2 is a waste of time if you're trying to learn anything about what will compile to good code. Your main() will optimize to just a ret at -O2.

Besides that, it's not totally surprising that you get bad asm from assigning elements of a vector one at a time.


Aside: normal people would call the type v4df (vector of 4 Double Float) or something, not vector, so they don't go insane when using it with C++ std::vector. For single-precision, v8sf. IIRC, gcc uses type names like this internally for __m256d.

On x86, Intel intrinsic types (like __m256d) are implemented on top of GNU C vector syntax (which is why you can do v1 * v2 in GNU C instead of writing _mm256_mul_pd(v1, v2)). You can convert freely from __m256d to v4df, like I've done here.

I've wrapped both sane ways to do this in functions, so we can look at their asm. Notice how we're not loading from an array that we define inside the same function, so the compiler won't optimize it away.

I put them on the Godbolt compiler explorer so you can look at the asm with various compile options and compiler versions.

typedef double v4df __attribute__((vector_size(4 * sizeof(double))));

#include <immintrin.h>

// note the return types.  gcc6.1 compiles with no warnings, even at -Wall -Wextra
v4df load_4_doubles_intel(const double *p) { return _mm256_loadu_pd(p); }
    vmovupd ymm0, YMMWORD PTR [rdi]   # tmp89,* p
    ret

v4df avx_constant() { return _mm256_setr_pd( 1.0, 2.0, 3.0, 4.0 ); }
    vmovapd ymm0, YMMWORD PTR .LC0[rip]
    ret

If the args to _mm_set* intrinsics aren't compile-time constants, the compiler will do the best it can to make efficient code to get all the elements into a single vector. It's usually best to do that rather than writing C that stores to a tmp array and loads from it, because that's not always the best strategy. (Store-forwarding failure on multiple narrow stores forwarding to a wide load costs an extra ~10 cycles (IIRC) of latency on top of the usual store-forwarding delay. If your doubles are already in registers, it's usually best to just shuffle them together.)


See also Is it possible to cast floats directly to __m128 if they are 16 byte alligned? for a list of the various intrinsics for getting a single scalar into a vector. The tag wiki has links to Intel's manuals, and their intrinsics finder.


Load/store GNU C vectors without Intel intrinsics:

I'm not sure how you're "supposed" to do that. This Q&A suggests casting a pointer to the memory you want to load, and using a vector type like typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16; (note the aligned(1) attribute).

You get a segfault from *(v4df *)a because presumably a isn't aligned on a 32-byte boundary, but you're using a vector type that does assume natural alignment. (Just like __m256d if you dereference a pointer to it instead of using load/store intrinsics to communicate alignment info to the compiler.)

这篇关于你如何使用GNU C Vector Extensions加载/存储/打包一个双精度数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆