使用SSE2(作为浮点)缩放字节像素值(y = ax + b)? [英] Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?
问题描述
我想计算 y = ax + b
,其中x和y是像素值[即,值范围的字节为0〜255] c $ c> a 和 b
是一个浮动
这个公式对于图像中的每个像素,此外,a和b对于不同的像素是不同的。在C ++中的直接计算很慢,所以我很有兴趣知道c ++中的sse2指令。
搜索后,我发现float中的乘法和加法sse2就像 _mm_mul_ps
和 _mm_add_ps
。但是首先我需要将字节中的x转换为float(4字节)。
问题是,在从字节数据源 _mm_load_si128
),如何将数据从字节转换为浮点型?
<对于每个像素,p> a
和 b
是不同的?
有没有办法可以有效地生成 a
b
在向量中,作为定点或浮点?如果不是,插入4个FP值或8个16位整数可能比标量操作更糟。 / h2>
如果 a
和 b
或者首先使用定点生成,这可能是定点数学的一个很好的用例。 (即表示值* 2 ^ scale的整数)。 SSE / AVX没有8b * 8b-> 16b乘;最小的元素是字,所以你必须解压缩字节到字,但不是一直到32位。
有一个 _mm_maddubs_epi16
指令,如果 b
和 a
更改不够频繁,或者您可以轻松生成交替a * 2 ^ 4和b * 2的向量^ 1个字节。显然,它是真正的方便的双线性插值法,但它仍然得到
float a,b;如果我们可以准备一个a和b向量,
const int logascalale = 4,logbscale = 1;
const int ascale = 1<< logascale; // fixed point scale for a:2 ^ 4
const int bscale = 1<< logbscale; // b的定点标度:2 ^ 1
const __m128i brescale = _mm_set1_epi8(1 <(logascale-logbscale)); //重新缩放b以匹配16位临时结果中的a
for(i = 0; i // __ m128i avec = get_scaled_a ;
// __ m128i bvec = get_scaled_b(i);
// __ m128i ab_lo = _mm_unpacklo_epi8(avec,bvec);
// __ m128i ab_hi = _mm_unpackhi_epi8(avec,bvec);
__m128i abvec = _mm_set1_epi16(((int8_t)(bscale * b)<< 8)|(int8_t)(ascale * a) //整数提升规则可能在这里的错误的地方做符号扩展,所以如果你实际写这样的方式检查这个。
__m128i block = _mm_load_si128(& buf [i]); // call this {v [0] .. v [15]}
__m128i lo = _mm_unpacklo_epi8(block,brescale); // {v [0],8,v [1],8,...}
__m128i hi = _mm_unpackhi_epi8(block,brescale); // {v [8],8,v [9],8,...
lo = _mm_maddubs_epi16(lo,abvec); // first arg is unsigned bytes,2nd arg is signed bytes
hi = _mm_maddubs_epi16(hi,abvec);
// lo = {v [0] *(2 ^ 4 * a)+ 8 *(2 ^ 1 * b),...}
lo = _mm_srli_epi16 logascale); // truncated from scaled fixed-point to integer
hi = _mm_srli_epi16(hi,logascale);
//并重新打包。逻辑,不是算术右移位,不能设置符号位
block = _mm_packuswb(lo,hi);
_mm_store_si128(& buf [i],block);
}
//然后标量清理循环
2 ^ 4是任意选择。它为 a
的整数部分和4个小数位留下3个非符号位。因此,它有效地将 a
舍入到最接近的16,如果它的幅度大于8和15 / 16th,则溢出。 2 ^ 6将给出更多的小数位,并允许从<-2到+1和63/64>的 a
。
因为 b
正被添加,而不是相乘,所以它的有效范围要大得多,而小数部分的有用性要小得多。为了以8位来表示它,将其四舍五入到最接近的一半仍然保留一些小数信息,但允许它[-64:63.5]没有溢出。
为了更精确,16b定点是一个不错的选择。你可以缩放 a
和 b
向上乘以2 ^ 7或某事,以具有分数精度的7b,整数部分为[-256..255]。这种情况下没有乘法和加法指令,所以你必须单独做。用于执行乘法的好选项包括:
-
_mm_mulhi_epu16
:unsigned 16b * 16b- > high16(bits [31:16])。有用的a
不能为负 -
_mm_mulhi_epi16
:signed 16b * 16b-> high16(bits [31:16])。 -
_mm_mulhrs_epi16
:signed 16b * 16b-> bits [30:15 ]的32b临时,有舍入。有一个很好的缩放因子为a
,这应该更好。根据我的理解,SSSE3为这种使用引入了这个指令。 -
_mm_mullo_epi16
:signed 16b * 16b-> low16位[15:0])。这只允许在$ 16结果溢出之前a
的8个有效位,所以我想你在_mm_maddubs_epi16
8bit解决方案对于b
更精确。
获得 a
和 b
值的缩放16b向量,然后:
- 使用零(或
pmovzx
byte->字)解压缩字节,以获得仍在[0..255]范围 - 向左移动字词7。
- 乘以
a
字,取每个16 * 16-> 32结果的上半部分。 (例如mul - 如果您想要
a
和b
,以获得a
- 添加
b
-
-
- 向右移动以执行从固定点到[0..255]的最终截断。 $ b
有了一个很好的定点定标,这应该能够处理更广泛的
a
和b $
如果你在把它们解压缩到单词之后不左移你的字节,<$ p> c $ c> a 必须是全范围的,以便在结果的high16中设置8位,这意味着
a $ c $的范围非常有限即使
之前,我意识到有多少额外的指令,它是需要的。删除它,因为这个答案已经很长,另一个代码块会混淆。_mm_mulhrs_epi16
也不会留下很多空间,因为它从第30位开始。 p>
将字节扩展为浮点
如果无法有效生成每个像素的固定点
a
和b
的值,最好将像素转换为浮点值。这需要更多的拆包/重新打包,因此延迟和吞吐量更糟。值得研究生成一个固定点的a和b。
对于packed-float来说,你仍然需要有效地构建一个
a c>
$ b这是一个很好的用例
pmovzx
(SSE4.1),因为它可以直接从8b元素到32b。其他选项是具有多个步骤的SSE2punpck [l / h] bw / punpck [l / h] wd
,或SSSE3pshufb
模拟pmovzx
。 (你可以做一个16B加载和shuffle它4种不同的方式解压缩到32b的四个向量。)char * f
// const __m128i zero = _mm_setzero_si128();
for(i = 0; i__m128 a = get_a(i);
__m128 b = get_b(i);
// IDK为什么没有使用`pmovzx`作为加载的内在函数,因为它需要一个m32或m64操作数,而不是m128。 (不像punpck *)
__m128i unsigned_dwords = _mm_cvtepu8_epi32((__ m128i)(buf + i)); //一次加载4B。
__m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
floats = _mm_fmadd_ps(float,a,b); //有FMA可用,这可能是256b向量,即使有不同的车道交叉语义pmovzx与punpck
//或没有FMA的不便,这样做_mm_mul_ps和_mm_add_ps
unsigned_dwords = _mm_cvtps_epi32(float);
//对buf + 4,buf + 8和buf + 12再重复3次,然后:
__m128i packed01 = _mm_packss_epi32(dwords0,dwords1); // SSE2
__m128i packed23 = _mm_packss_epi32(dwords2,dwords3);
// packuswb想要SIGNED输入,所以在第一步的符号饱和
//饱和到[0..255]范围
__m12i8 packedbytes = _mm_packus_epi16(packed01,packed23 ); // SSE2
_mm_store_si128(buf + i,packedbytes); //或如果buf未对齐,则为storeu。
}
//清除代码以处理奇数的最多15个剩余字节,如果n%16!= 0
此回答的前一版本从float-> uint8向量与packusdw / packuswb,并且有一个整体部分没有SSE4.1的解决方法。如果您只需保留在有符号整数域中,直到最后一个数据包为止,则不需要在未签名的数据包之后屏蔽符号位/ a>。我认为这是为什么SSE2只包括从双字到字的签名包,但都从字到字节的签名和未签名的包。
packuswd
仅在您的最终目标是uint16_t
时才有用,而不是进一步打包。
最后一个CPU到不是有SSE4.1是Intel Conroe / merom(第一代Core2,从2007年年底之前)和AMD前巴塞罗那(2007年底之前)。如果这些CPU可以接受工作缓慢,只需为AVX2编写一个版本,为SSE4.1编写一个版本。或者SSSE3(使用4x pshufb来模拟寄存器的四个32b元素的pmovzxbd)pshufb在Conroe上很慢,所以如果你关心没有SSE4.1的CPU,写一个特定的版本。实际上,Conroe / merom也有慢xmm
punpcklbw
等等(除了q-> dq)。 4x slowpshufb
应该仍然会打6x缓慢解压缩。矢量化是一个很少的前沃尔夫戴的胜利,因为慢速洗牌开箱和重新包装。
查看未完成的使用<$ c $的尝试的编辑历史记录c> punpck
I want to calculate
y = ax + b
, where x and y is a pixel value [i.e, byte with value range is 0~255], whilea
andb
is a floatSince I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..
After searching, I find that the multiplication and addition in float with sse2 is just as
_mm_mul_ps
and_mm_add_ps
. But in the first place I need to convert the x in byte to float (4 byte).The question is, after I load the data from byte-data source (
_mm_load_si128
), how can I convert the data from byte to float?解决方案a
andb
are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate themIs there any way you can efficiently generate
a
andb
in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.
Fixed point
If
a
andb
can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.There's a
_mm_maddubs_epi16
instruction which might be useful ifb
anda
change infrequently enough, or you can easily generate a vector with alternating a*2^4 and b*2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.float a, b; const int logascale = 4, logbscale=1; const int ascale = 1<<logascale; // fixed point scale for a: 2^4 const int bscale = 1<<logbscale; // fixed point scale for b: 2^1 const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale)); // re-scale b to match a in the 16bit temporary result for (i=0 ; i<n; i+=16) { //__m128i avec = get_scaled_a(i); //__m128i bvec = get_scaled_b(i); //__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec); //__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec); __m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) ); // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way. __m128i block = _mm_load_si128(&buf[i]); // call this { v[0] .. v[15] } __m128i lo = _mm_unpacklo_epi8(block, brescale); // {v[0], 8, v[1], 8, ...} __m128i hi = _mm_unpackhi_epi8(block, brescale); // {v[8], 8, v[9], 8, ... lo = _mm_maddubs_epi16(lo, abvec); // first arg is unsigned bytes, 2nd arg is signed bytes hi = _mm_maddubs_epi16(hi, abvec); // lo = { v[0]*(2^4*a) + 8*(2^1*b), ... } lo = _mm_srli_epi16(lo, logascale); // truncate from scaled fixed-point to integer hi = _mm_srli_epi16(hi, logascale); // and re-pack. Logical, not arithmetic right shift means sign bits can't be set block = _mm_packuswb(lo, hi); _mm_store_si128(&buf[i], block); } // then a scalar cleanup loop
2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of
a
, and 4 fraction bits. So it effectively roundsa
to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allowa
from -2 to +1 and 63/64ths.Since
b
is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.For more precision, 16b fixed-point is a good choice. You can scale
a
andb
up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:_mm_mulhi_epu16
: unsigned 16b*16b->high16 (bits [31:16]). Useful ifa
can't be negative_mm_mulhi_epi16
: signed 16b*16b->high16 (bits [31:16])._mm_mulhrs_epi16
: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor fora
, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use._mm_mullo_epi16
: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits fora
before the low16 result overflows, so I think all you gain over the_mm_maddubs_epi16
8bit solution is more precision forb
.
To use these, you'd get scaled 16b vectors of
a
andb
values, then:- unpack your bytes with zero (or
pmovzx
byte->word), to get signed words still in the [0..255] range - left shift the words by 7.
- multiply by your
a
vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul - right shift here if you wanted different scales for
a
andb
, to get more fractional precision fora
- add
b
to that. - right shift to do the final truncation back from fixed point to [0..255].
With a good choice of fixed-point scale, this should be able to handle a wider range of
a
andb
, as well as more fractional precision, than 8bit fixed point.If you don't left-shift your bytes after unpacking them to words,
a
has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range ofa
that you could support without truncating your temporary to less than 8 bits during the multiply. Even_mm_mulhrs_epi16
doesn't leave much room, since it starts at bit 30.
expand bytes to floats
If you can't efficiently generate fixed-point
a
andb
values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.For packed-float to work, you still have to efficiently build a vector of
a
values for 4 adjacent pixels.This is a good use-case for
pmovzx
(SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2punpck[l/h]bw/punpck[l/h]wd
with multiple steps, or SSSE3pshufb
to emulatepmovzx
. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)char *buf; // const __m128i zero = _mm_setzero_si128(); for (i=0 ; i<n; i+=16) { __m128 a = get_a(i); __m128 b = get_b(i); // IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128. (unlike punpck*) __m128i unsigned_dwords = _mm_cvtepu8_epi32((__m128i)(buf+i)); // load 4B at once. __m128 floats = _mm_cvtepi32_ps(unsigned_dwords); floats = _mm_fmadd_ps(floats, a, b); // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck // or without FMA, do this with _mm_mul_ps and _mm_add_ps unsigned_dwords = _mm_cvtps_epi32(floats); // repeat 3 more times for buf+4, buf+8, and buf+12, then: __m128i packed01 = _mm_packss_epi32(dwords0, dwords1); // SSE2 __m128i packed23 = _mm_packss_epi32(dwords2, dwords3); // packuswb wants SIGNED input, so do signed saturation on the first step // saturate into [0..255] range __m12i8 packedbytes=_mm_packus_epi16(packed01, packed23); // SSE2 _mm_store_si128(buf+i, packedbytes); // or storeu if buf isn't aligned. } // cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0
The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte.
packuswd
is only useful if your final goal isuint16_t
, rather than further packing.
The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm
punpcklbw
and so on (except for q->dq). 4x slowpshufb
should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.See the edit history for an unfinished attempt at using
punpck
before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.这篇关于使用SSE2(作为浮点)缩放字节像素值(y = ax + b)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!