计算大点积的最快方法是什么? [英] What is the fastest way to compute large dot products?

查看：46 发布时间：2021/8/27 19:46:11 optimization assembly sse avx dot-product

本文介绍了计算大点积的最快方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

考虑这个片段:

double dot(double* a, double* b, int n) {
  double sum = 0;
  for (int i = 0; i < n; ++i) sum += a[i] * b[i];
  return sum;
}

如何使用内在函数或汇编程序加快速度?

How can I speed it up using intrinsics or assembler?

注意事项:

您可以采用最新的架构，包括 AVX 扩展.
n 是几百个.
dot 本身将被紧密循环使用

推荐答案

这是一个简单的 SSE 实现:

Here is a simple SSE implementation:

#include "pmmintrin.h"

__m128d vsum = _mm_set1_pd(0.0);
double sum = 0.0;
int k;

// process 2 elements per iteration
for (k = 0; k < n - 1; k += 2)
{
    __m128d va = _mm_loadu_pd(&a[k]);
    __m128d vb = _mm_loadu_pd(&b[k]);
    __m128d vs = _mm_mul_pd(va, vb);
    vsum = _mm_add_pd(vsum, vs);
}

// horizontal sum of 2 partial dot products
vsum = _mm_hadd_pd(vsum, vsum);
_mm_store_sd(&sum, vsum);

// clean up any remaining elements
for ( ; k < n; ++k)
{
    sum += a[k] * b[k];
}

请注意，如果您可以保证 a 和 b 是 16 字节对齐的，那么您可以使用 _mm_load_pd 而不是 _mm_loadu_pd 这可能有助于性能，尤其是在较旧的(pre Nehalem) CPU.

Note that if you can guarantee that a and b are 16 byte aligned then you can use _mm_load_pd rather than _mm_loadu_pd which may help performance, particularly on older (pre Nehalem) CPUs.

另请注意，对于像这样的循环，其中相对于加载数量的算术指令非常少，那么性能很可能会受到内存带宽的限制，并且在实践中可能无法实现矢量化的预期加速.

Note also that for loops such as this where the are very few arithmetic instructions relative to the number of loads then performance may well be limited by memory bandwidth and the expected speed-up from vectorization may not be realised in practice.

如果您想以 AVX 为目标 CPU，那么将上述 SSE 实现转换为每次迭代处理 4 个元素而不是 2 个元素是一个相当简单的转换:

If you want to target CPUs with AVX then it's a fairly straightforward conversion from the above SSE implementation to process 4 elements per iteration rather than 2:

#include "immintrin.h"

__m256d vsum = _mm256_set1_pd(0.0);
double sum = 0.0;
int k;

// process 4 elements per iteration
for (k = 0; k < n - 3; k += 4)
{
    __m256d va = _mm256_loadu_pd(&a[k]);
    __m256d vb = _mm256_loadu_pd(&b[k]);
    __m256d vs = _mm256_mul_pd(va, vb);
    vsum = _mm256_add_pd(vsum, vs);
}

// horizontal sum of 4 partial dot products
vsum = _mm256_hadd_pd(_mm256_permute2f128_pd(vsum, vsum, 0x20), _mm256_permute2f128_pd(vsum, vsum, 0x31));
vsum = _mm256_hadd_pd(_mm256_permute2f128_pd(vsum, vsum, 0x20), _mm256_permute2f128_pd(vsum, vsum, 0x31));
_mm256_store_sd(&sum, vsum);

// clean up any remaining elements
for ( ; k < n; ++k)
{
    sum += a[k] * b[k];
}

这篇关于计算大点积的最快方法是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算大点积的最快方法是什么? [英] What is the fastest way to compute large dot products?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算大点积的最快方法是什么? [英] What is the fastest way to compute large dot products?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭