SSE指令:字节+短 [英] SSE Instructions: Byte+Short
问题描述
我有很长的字节数组,需要将它们添加到 short
(或 int
)类型的目标数组中.这样的SSE指令存在吗?或者他们的套餐?
I have very long byte arrays that need to be added to a destination array of type short
(or int
).
Does such SSE instruction exist? Or maybe their set ?
推荐答案
您需要将每个 8 位值的向量解包为两个 16 位值的向量,然后将它们相加.
You need to unpack each vector of 8 bit values to two vectors of 16 bit values and then add those.
__m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 }
__m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }
其中 v
是 16 x 8 位值的向量,vl
、vh
是两个解包后的 8 x 16 位值向量.
where v
is a vector of 16 x 8 bit values and vl
, vh
are the two unpacked vectors of 8 x 16 bit values.
请注意,我假设 8 位值是无符号的,因此当解包为 16 位时,高字节设置为 0(即无符号扩展).
Note that I'm assuming that the 8 bit values are unsigned so when unpacking to 16 bits the high byte is set to 0 (i.e. no sign extension).
如果你想对很多这些向量求和并得到一个 32 位的结果,那么一个有用的技巧是使用乘数为 1 的 _mm_madd_epi16
,例如
If you want to sum a lot of these vectors and get a 32 bit result then a useful trick is to use _mm_madd_epi16
with a multiplier of 1, e.g.
__m128i vsuml = _mm_set1_epi32(0);
__m128i vsumh = _mm_set1_epi32(0);
__m128i vsum;
int sum;
for (int i = 0; i < N; i += 16)
{
__m128i v = _mm_load_si128(&x[i]);
__m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0));
__m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0));
vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1)));
vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1)));
}
// do horizontal sum of 4 partial sums and store in scalar int
vsum = _mm_add_epi32(vsuml, vsumh);
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
sum = _mm_cvtsi128_si32(vsum);
这篇关于SSE指令:字节+短的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!