是否有更有效的方法将4个连续的双打广播到4个YMM寄存器中? [英] Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

查看:144
本文介绍了是否有更有效的方法将4个连续的双打广播到4个YMM寄存器中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



<$ p $在一段C ++代码中,类似于(但不是完全)矩阵乘法,我将4个连续双精度加载到4个YMM寄存器中,如下所示: $ a $ b $ __m256d b0 = _mm256_broadcast_sd(& b [4 * k + 0])的64字节对齐数组;
__m256d b1 = _mm256_broadcast_sd(& b [4 * k + 1]);
__m256d b2 = _mm256_broadcast_sd(& b [4 * k + 2]);
__m256d b3 = _mm256_broadcast_sd(& b [4 * k + 3]);

我在Sandy Bridge机器上用gcc-4.8.2编译了代码。硬件事件计数器(英特尔PMU)建议CPU实际从L1缓存中发出4个独立的负载。尽管在这一点上我不受L1延迟或带宽的限制,但我很想知道是否有办法用一个256位负载(或两个128位负载)加载4个双打,然后将它们洗入4个YMM寄存器。我浏览了 Intel Intrinsics Guide ,但找不到一种方法来完成洗牌需要。这是可能的吗?



(如果前提是CPU不连接4个连续的负载实际上是错误的,请告诉我。)

解决方案

矩阵乘法码我只需要在每个内核代码中使用一次广播,但如果你真的想在一条指令中加载四个双精度,然后将它们广播到四个寄存器你可以这样做

  #include< stdio.h> 
#include< immintrin.h>

int main(){
double [] = {1,2,3,4};
翻倍[4];
__m256d x4 = _mm256_loadu_pd(in);
__m256d t1 = _mm256_permute2f128_pd(x4,x4,0x0);
__m256d t2 = _mm256_permute2f128_pd(x4,x4,0x11);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);

_mm256_storeu_pd(out,broad1);
printf(%f%f%f%f\\\
,out [0],out [1],out [2],out [3]);
_mm256_storeu_pd(out,broad2);
printf(%f%f%f%f\\\
,out [0],out [1],out [2],out [3]);
_mm256_storeu_pd(out,broad3);
printf(%f%f%f%f\\\
,out [0],out [1],out [2],out [3]);
_mm256_storeu_pd(out,broad4);
printf(%f%f%f%f\\\
,out [0],out [1],out [2],out [3]);
}

编辑:这是另一个基于Paul R建议的解决方案。

  __ m256 t1 = _mm256_broadcast_pd((__ m128d *)& b [4 * k + 0]); 
__m256 t2 = _mm256_broadcast_pd((__ m128d *)& b [4 * k + 2]);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);


In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:

# a is a 64-byte aligned array of double
__m256d b0 = _mm256_broadcast_sd(&b[4*k+0]);
__m256d b1 = _mm256_broadcast_sd(&b[4*k+1]);
__m256d b2 = _mm256_broadcast_sd(&b[4*k+2]);
__m256d b3 = _mm256_broadcast_sd(&b[4*k+3]);

I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU actually issue 4 separate load from the L1 cache. Although at this point I'm not limited by L1 latency or bandwidth, I'm very interested to know if there is a way to load the 4 doubles with one 256-bit load (or two 128-bit loads) and shuffle them into 4 YMM registers. I looked through the Intel Intrinsics Guide but couldn't find a way to accomplish the shuffling required. Is that possible?

(If the premise that the CPU doesn't combine the 4 consecutive loads is actually wrong, please let me know.)

解决方案

In my matrix multiplication code I only have to use the broadcast once per kernel code but if you really want to load four doubles in one instruction and then broadcast them to four registers you can do it like this

#include <stdio.h>
#include <immintrin.h>

int main() {
    double in[] = {1,2,3,4};
    double out[4];
    __m256d x4 = _mm256_loadu_pd(in);
    __m256d t1 = _mm256_permute2f128_pd(x4, x4, 0x0);
    __m256d t2 = _mm256_permute2f128_pd(x4, x4, 0x11);
    __m256d broad1 = _mm256_permute_pd(t1,0);
    __m256d broad2 = _mm256_permute_pd(t1,0xf);
    __m256d broad3 = _mm256_permute_pd(t2,0);
    __m256d broad4 = _mm256_permute_pd(t2,0xf);

    _mm256_storeu_pd(out,broad1);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
    _mm256_storeu_pd(out,broad2);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
    _mm256_storeu_pd(out,broad3);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
    _mm256_storeu_pd(out,broad4);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
}

Edit: Here is another solution based on Paul R's suggestion.

__m256 t1 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256 t2 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);

这篇关于是否有更有效的方法将4个连续的双打广播到4个YMM寄存器中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆