使用SSE乘以C ++中32位整数的两个向量的最快方法 [英] Fastest way to multiply two vectors of 32bit integers in C++, with SSE

查看:878
本文介绍了使用SSE乘以C ++中32位整数的两个向量的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个无符号向量,两者都是size 4

  vector< unsigned& v1 = {2,4,6,8} 
向量< unsigned> v2 = {1,10,11,13}



现在我想把这两个向量相乘,得到一个新的

 矢量< unsigned> v_result = {2 * 1,4×10,6 * 11,8 * 13} 

SSE操作使用?它是跨平台还是只有
在某些指定的平台?



添加:
如果我的目标是添加不乘法,我可以做这个超快:

  __ m128i a = _mm_set_epi32(1,2,3,4); 
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);


解决方案

使用集合内在函数如 _mm_set_epi32 对于所有元素都是低效的。最好使用负载内在函数。有关详情,请参阅此讨论 SSE指令在哪里胜过正常指令。如果数组是16字节对齐的,您可以使用 _mm_load_si128 _mm_loadu_si128 (对于对齐内存,它们具有几乎相同的效率)否则使用 _mm_loadu_si128 。但是对齐的内存更有效率。要获得对齐的内存,我推荐 _mm_malloc _mm_free ,或C11 aligned_alloc ,以便您可以使用正常的免费






假设你有两个向量加载在SSE寄存器 __ m128i a __ m128i b



对于SSE版本> = SSE4.1使用

  _mm_mullo_epi32(a,b); 






没有SSE4.1: strong>



此代码是从Agner Fog的矢量类库(并由此答案的原作者剽窃):

  // Vec4i operator * Vec4i const& a,Vec4i const& b){
// #ifdef
__m128i a13 = _mm_shuffle_epi32(a,0xF5); //( - ,a3, - ,a1)
__m128i b13 = _mm_shuffle_epi32(b,0xF5); //( - ,b3, - ,b1)
__m128i prod02 = _mm_mul_epu32(a,b); //( - ,a2 * b2, - ,a0 * b0)
__m128i prod13 = _mm_mul_epu32(a13,b13); //( - ,a3 * b3, - ,a1 * b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13); //( - , - ,a1 * b1,a0 * b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13); //( - , - ,a3 * b3,a2 * b2)
__m128i prod = _mm_unpacklo_epi64(prod01,prod23); //(ab3,ab2,ab1,ab0)


I have two unsigned vectors, both with size 4

vector<unsigned> v1 = {2, 4, 6, 8}
vector<unsigned> v2 = {1, 10, 11, 13}

Now I want to multiply these two vectors and get a new one

vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13}

What is the SSE operation to use? Is it cross platform or only in some specified platforms?

Adding: If my goal is adding not multiplication, I can do this super fast:

__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);

解决方案

Using the set intrinsics such as _mm_set_epi32 for all elements is inefficient. It's better to use the load intrinsics. See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either _mm_load_si128 or _mm_loadu_si128 (for aligned memory they have nearly the same efficiency) otherwise use _mm_loadu_si128. But aligned memory is much more efficient. To get aligned memory I recommend _mm_malloc and _mm_free, or C11 aligned_alloc so you can use normal free.


To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers __m128i a and __m128i b

For SSE version >=SSE4.1 use

_mm_mullo_epi32(a, b);


Without SSE4.1:

This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):

// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13    = _mm_shuffle_epi32(a, 0xF5);          // (-,a3,-,a1)
__m128i b13    = _mm_shuffle_epi32(b, 0xF5);          // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b);                 // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13);             // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13);   // (-,-,a1*b1,a0*b0) 
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13);   // (-,-,a3*b3,a2*b2) 
__m128i prod   = _mm_unpacklo_epi64(prod01,prod23);   // (ab3,ab2,ab1,ab0)

这篇关于使用SSE乘以C ++中32位整数的两个向量的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆