如何优化点积的 AVX 实现? [英] How can i optimize my AVX implementation of dot product?

查看:77
本文介绍了如何优化点积的 AVX 实现?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 AVX 实现这两个数组的点积 https://stackoverflow.com/a/10459028.但是我的代码很慢.

Axb 是双精度数组,n 是偶数.你能帮助我吗?

const int mask = 0x31;int sum =0;for (int i = 0; i  n)//填充{sum += A[ind] * xb[i].x;我++;ind = n * j + i;sum += A[ind] * xb[i].x;继续;}__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };__m256d x = _mm256_loadu_pd(&A[ind]);__m256d y = _mm256_load_pd(ar);我+=4;ind = n * j + i;__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };__m256d z = _mm256_loadu_pd(&A[ind]);__m256d w = _mm256_load_pd(arr);__m256d xy = _mm256_mul_pd(x, y);__m256d zw = _mm256_mul_pd(z, w);__m256d 温度 = _mm256_hadd_pd(xy, zw);__m128d hi128 = _mm256_extractf128_pd(temp, 1);__m128d low128 = _mm256_extractf128_pd(temp, 0);//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);__m128d dotproduct = _mm_add_pd(low128, hi128);总和 += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];我 += 3;}

解决方案

您的循环中有两个明显的低效率问题:

(1) 这两块标量代码:

__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].X };...__m256d y = _mm256_load_pd(ar);

__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].X };...__m256d w = _mm256_load_pd(arr);

应该使用 SIMD 加载和 shuffle 来实现(或者至少使用 _mm256_set_pd 并让编译器有机会为收集的加载生成代码的一半合理的工作).

(2) 循环结束时的水平求和:

for (int i = 0; i 

应该移出循环:

__m256d xy = _mm256_setzero_pd();__m256d zw = _mm256_setzero_pd();...for (int i = 0; i 

I`ve tried to implement dot product of this two arrays using AVX https://stackoverflow.com/a/10459028. But my code is very slow.

A and xb are arrays of doubles, n is even number. Can you help me?

const int mask = 0x31;
int sum =0;

for (int i = 0; i < n; i++)
{
    int ind = i;
    if (i + 8 > n) // padding
    {
        sum += A[ind] * xb[i].x;
        i++;
        ind = n * j + i;
        sum += A[ind] * xb[i].x;
        continue;
    }

    __declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
    __m256d x = _mm256_loadu_pd(&A[ind]);
    __m256d y = _mm256_load_pd(ar);
    i+=4; ind = n * j + i;
    __declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
    __m256d z = _mm256_loadu_pd(&A[ind]);
    __m256d w = _mm256_load_pd(arr);

    __m256d xy = _mm256_mul_pd(x, y);
    __m256d zw = _mm256_mul_pd(z, w);
    __m256d temp = _mm256_hadd_pd(xy, zw);
    __m128d hi128 = _mm256_extractf128_pd(temp, 1);
    __m128d low128 = _mm256_extractf128_pd(temp, 0);
    //__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
    __m128d dotproduct = _mm_add_pd(low128, hi128);

    sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
    i += 3;
}

解决方案

There are two big inefficiencies in your loop that are immediately apparent:

(1) these two chunks of scalar code:

__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d y = _mm256_load_pd(ar);

and

__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d w = _mm256_load_pd(arr);

should be implemented using SIMD loads and shuffles (or at the very least use _mm256_set_pd and give the compiler a chance to do a half-reasonable job of generating code for a gathered load).

(2) the horizontal summation at the end of the loop:

for (int i = 0; i < n; i++)
{
    ...
    __m256d xy = _mm256_mul_pd(x, y);
    __m256d zw = _mm256_mul_pd(z, w);
    __m256d temp = _mm256_hadd_pd(xy, zw);
    __m128d hi128 = _mm256_extractf128_pd(temp, 1);
    __m128d low128 = _mm256_extractf128_pd(temp, 0);
    //__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
    __m128d dotproduct = _mm_add_pd(low128, hi128);

    sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
    i += 3;
}

should be moved out of the loop:

__m256d xy = _mm256_setzero_pd();
__m256d zw = _mm256_setzero_pd();
...
for (int i = 0; i < n; i++)
{
    ...
    xy = _mm256_add_pd(xy, _mm256_mul_pd(x, y));
    zw = _mm256_add_pd(zw, _mm256_mul_pd(z, w));
    i += 3;
}
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);

sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];

这篇关于如何优化点积的 AVX 实现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆