SSE指令增加一个阵列中的所有元素 [英] SSE instructions to add all elements of an array

查看:209
本文介绍了SSE指令增加一个阵列中的所有元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的SSE2指令集。我发现 _mm_add_epi8 的指令,可以添加两个数组元素。但我想的SSE指令,可以增加一个阵列中的所有元素。

I am new to SSE2 instructions. I have found an instruction _mm_add_epi8 which can add two array elements. But I want an SSE instruction which can add all elements of an array.

我试图用这个code来开发这一概念:

I was trying to develop this concept using this code:

#include <iostream>
#include <conio.h>
#include <emmintrin.h>

void sse(unsigned char* a,unsigned char* b); 

void main()
{
    /*unsigned char *arr;
    arr=(unsigned char *)malloc(50);*/

    unsigned char arr[]={'a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r','a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r'};
    unsigned char *next_arr=arr+16;
    for(int i=0;i<16;i++)
          printf("%d,%c   ",next_arr[i],next_arr[i]);
    sse(arr,next_arr);

    getch();
}

void sse(unsigned char* a,unsigned char* b)                                                                                                                                                                          
{                                                                                                                                                                                                                                                                                                                                                                                            
  __m128i* l = (__m128i*)a;                                                                                                                                                                                      
  __m128i* r = (__m128i*)b; 
  __m128i result;

      result= _mm_add_epi8(*l, *r);

      unsigned char *p;
         p=(unsigned char *)&result;

        for(int i=0;i<16;i++)
          printf("%d ",p[i]);

         printf("\n");
         l=(__m128i*)p;
         r=(__m128i*)(p+8);         
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         printf("%d ",p[0]);

         l=(__m128i*)p;
         r=(__m128i*)(p+4);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+2);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+1);
         result=_mm_add_epi8(*l, *r);
          p=(unsigned char *)&result;
            printf("result =%d ",p[0]);
}

所以,有谁能够告诉我它是如何可能使用SSE2指令增加一个阵列中的所有元素?

So can anybody please tell me how it is possible to add all elements of an array using SSE2 instructions ?

任何帮助将AP preciated。

Any help will be appreciated.

推荐答案

如果你只是想总结一个数组中的所有元素,那么你需要加载数据,它解压缩到一个更广泛的单元尺寸,再总结的解压元素。请注意,您可以维护多个部分和直到循环后,然后就做这些部分和最后一个总和。例如:

If you just want to sum all the elements of an array then you need to load the data, unpack it to a wider element size, and then sum the unpacked elements. Note that you can maintain multiple partial sums until after the loop and then just do one final sum of these partial sums. For example:

uint32_t sum_array(const uint8_t a[], int n)
{
    const __m128i vk0 = _mm_set1_epi8(0);       // constant vector of all 0s for use with _mm_unpacklo_epi8/_mm_unpackhi_epi8
    const __m128i vk1 = _mm_set1_epi16(1);      // constant vector of all 1s for use with _mm_madd_epi16
    __m128i vsum = _mm_set1_epi32(0);           // initialise vector of four partial 32 bit sums
    uint32_t sum;
    int i;

    for (i = 0; i < n; i += 16)
    {
        __m128i v = _mm_load_si128(&a[i]);      // load vector of 8 bit values
        __m128i vl = _mm_unpacklo_epi8(v, vk0); // unpack to two vectors of 16 bit values
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
                                                // unpack and accumulate 16 bit values to
                                                // 32 bit partial sum vector

    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

请注意,在上述code一种不明显的绝招 - 而不是每个16位的载体,进一步拆包一对32位向量(需要4个分组指令),然后用4个32位增加了(另一4条指令),我们使用 _mm_madd_epi16 PMADDWD )为1的被乘数和 _mm_add_epi32 来有效地给我们免费拆包,所以我们得到使用4个指令,而不是8相同的结果。

Note that there is one non-obvious trick in the above code - rather than further unpacking each 16 bit vector to a pair of 32 bit vectors (requiring 4 unpack instructions) and then using four 32 bit adds (another 4 instructions), we use _mm_madd_epi16 (PMADDWD) with a multiplicand of 1 and _mm_add_epi32 to effectively give us free unpacking, so we get the same result using 4 instructions instead of 8.

还要注意的是输入数组, A [] ,必须是16字节对齐,而<​​code> N 应是16的倍数。

Note also that the input array, a[], needs to be 16 byte aligned, and n should be a multiple of 16.

这篇关于SSE指令增加一个阵列中的所有元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆