SSE 指令添加数组的所有元素 [英] SSE instructions to add all elements of an array

查看:24
本文介绍了SSE 指令添加数组的所有元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 SSE2 指令的新手.我找到了一个指令 _mm_add_epi8 可以添加两个数组元素.但我想要一个可以添加数组所有元素的 SSE 指令.

I am new to SSE2 instructions. I have found an instruction _mm_add_epi8 which can add two array elements. But I want an SSE instruction which can add all elements of an array.

我试图用这段代码来发展这个概念:

I was trying to develop this concept using this code:

#include <iostream>
#include <conio.h>
#include <emmintrin.h>

void sse(unsigned char* a,unsigned char* b); 

void main()
{
    /*unsigned char *arr;
    arr=(unsigned char *)malloc(50);*/

    unsigned char arr[]={'a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r','a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r'};
    unsigned char *next_arr=arr+16;
    for(int i=0;i<16;i++)
          printf("%d,%c   ",next_arr[i],next_arr[i]);
    sse(arr,next_arr);

    getch();
}

void sse(unsigned char* a,unsigned char* b)                                                                                                                                                                          
{                                                                                                                                                                                                                                                                                                                                                                                            
  __m128i* l = (__m128i*)a;                                                                                                                                                                                      
  __m128i* r = (__m128i*)b; 
  __m128i result;

      result= _mm_add_epi8(*l, *r);

      unsigned char *p;
         p=(unsigned char *)&result;

        for(int i=0;i<16;i++)
          printf("%d ",p[i]);

         printf("\n");
         l=(__m128i*)p;
         r=(__m128i*)(p+8);         
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         printf("%d ",p[0]);

         l=(__m128i*)p;
         r=(__m128i*)(p+4);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+2);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+1);
         result=_mm_add_epi8(*l, *r);
          p=(unsigned char *)&result;
            printf("result =%d ",p[0]);
}

那么谁能告诉我如何使用 SSE2 指令添加数组的所有元素?

So can anybody please tell me how it is possible to add all elements of an array using SSE2 instructions ?

任何帮助将不胜感激.

推荐答案

如果你只是想对一个数组的所有元素求和,那么你需要加载数据,将它解包到更大的元素大小,然后对解包的元素求和元素.请注意,您可以保持多个部分总和直到循环结束,然后只对这些部分总和进行最后一个总和.例如:

If you just want to sum all the elements of an array then you need to load the data, unpack it to a wider element size, and then sum the unpacked elements. Note that you can maintain multiple partial sums until after the loop and then just do one final sum of these partial sums. For example:

uint32_t sum_array(const uint8_t a[], int n)
{
    const __m128i vk0 = _mm_set1_epi8(0);       // constant vector of all 0s for use with _mm_unpacklo_epi8/_mm_unpackhi_epi8
    const __m128i vk1 = _mm_set1_epi16(1);      // constant vector of all 1s for use with _mm_madd_epi16
    __m128i vsum = _mm_set1_epi32(0);           // initialise vector of four partial 32 bit sums
    uint32_t sum;
    int i;

    for (i = 0; i < n; i += 16)
    {
        __m128i v = _mm_load_si128(&a[i]);      // load vector of 8 bit values
        __m128i vl = _mm_unpacklo_epi8(v, vk0); // unpack to two vectors of 16 bit values
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
                                                // unpack and accumulate 16 bit values to
                                                // 32 bit partial sum vector

    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

请注意,上面的代码中有一个不明显的技巧——而不是将每个 16 位向量进一步解包为一对 32 位向量(需要 4 条解包指令),然后使用四个 32 位相加(另外 4 条指令),我们使用 _mm_madd_epi16 (PMADDWD) 和 1 的被乘数和 _mm_add_epi32 来有效地给我们免费解包,所以我们使用 4 得到相同的结果指令而不是 8.

Note that there is one non-obvious trick in the above code - rather than further unpacking each 16 bit vector to a pair of 32 bit vectors (requiring 4 unpack instructions) and then using four 32 bit adds (another 4 instructions), we use _mm_madd_epi16 (PMADDWD) with a multiplicand of 1 and _mm_add_epi32 to effectively give us free unpacking, so we get the same result using 4 instructions instead of 8.

还要注意输入数组a[]需要16字节对齐,n应该是16的倍数.

Note also that the input array, a[], needs to be 16 byte aligned, and n should be a multiple of 16.

这篇关于SSE 指令添加数组的所有元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆