C 中的 ARM Neon:如何在使用内在函数时组合不同的 128 位数据类型? [英] ARM Neon in C: How to combine different 128bit data types while using intrinsics?
问题描述
TLTR
对于 arm 内部函数,如何将 uint8x16_t
类型的 128 位变量输入到需要 uint16x8_t
的函数中?
For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t
into a function expecting uint16x8_t
?
<小时>扩展版
上下文:我有一个灰度图像,每个像素 1 个字节.我想将其缩小 2 倍.对于每个 2x2 输入框,我想取最小像素.在普通 C 中,代码如下所示:
Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:
for (int y = 0; y < rows; y += 2) {
uint8_t* p_out = outBuffer + (y / 2) * outStride;
uint8_t* p_in = inBuffer + y * inStride;
for (int x = 0; x < cols; x += 2) {
*p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
p_out++;
p_in+=2;
}
}
其中行和列都是 2 的倍数.我将步幅"称为从一个像素到图像中紧邻下方的像素所需的字节步长.
Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.
现在我想将其矢量化.思路是:
Now I want to vectorize this. The idea is:
- 取2个连续的像素行
- 从最上面一行加载
a
中的16个字节,然后在b
中加载紧接其下的16个字节 - 逐字节计算
a
和b
之间的最小字节.存储在a
中. - 创建一个
a
的副本,将其右移 1 个字节(8 位).将其存储在b
中. - 逐字节计算
a
和b
之间的最小字节.存储在a
中. - 在输出图像中存储
a
的每第二个字节(丢弃一半的字节)
- take 2 consecutive rows of pixels
- load 16 bytes in
a
from the top row, and load the 16 bytes immediately below inb
- compute the minimum byte by byte between
a
andb
. Store ina
. - create a copy of
a
shifting it right by 1 byte (8 bits). Store it inb
. - compute the minimum byte by byte between
a
andb
. Store ina
. - store every second byte of
a
in the output image (discards half of the bytes)
我想用 Neon 内在函数来写这个.好消息是,对于每一步,都存在与其匹配的内在函数.
I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.
例如,在第 3 点可以使用(来自 此处):
For example, at point 3 one can use (from here):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
在第 4 点,您可以使用以下 8 位移位之一(来自 这里):
And at point 4 one can use one of the following using a shift of 8 bits (from here):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
那是因为我不在乎字节 1,3,5,7,9,11,13,15 会发生什么,因为无论如何它们都会从最终结果中被丢弃.(已经验证了正确性,不是问题所在.)
That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)
然而,vminq_u8
的输出是 uint8x16_t
类型,它与我想使用的移位内在函数不兼容.在 C++ 中,我用这个模板化数据结构解决了这个问题,而我被告知问题不能使用联合可靠地解决 (虽然该答案指的是 C++,实际上 在 C 类型双关语中是允许的),也不能使用指针强制转换,因为这会破坏严格的别名规则.
HOWEVER, the output of vminq_u8
is of type uint8x16_t
, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union ( although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.
在使用 ARM Neon 内在函数时如何组合不同的数据类型?
What is the way to combine different data types while using ARM Neon intrinsics?
推荐答案
针对这类问题,arm_neon.h 提供了vreinterpret{q}_dsttype_srctype 转换运算符.
For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.
在某些情况下,您可能希望将向量视为具有不同的类型,而不改变它的值.一组内在函数是用于执行此类转换.
In some situations, you might want to treat a vector as having a different type, without changing its value. A set of intrinsics is provided to perform this type of conversion.
所以,假设 a
和 b
被声明为:
So, assuming a
and b
are declared as:
uint8x16_t a, b;
你的第 4 点可以写成(*):
Your point 4 can be written as(*):
b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );
但是,请注意,不幸的是,这并没有解决使用向量类型数组的数据类型,请参阅ARM Neon:如何从 uint8x16_t 转换为 uint8x8x2_t?
However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?
<子>(*) 应该说,这比等效的(在此特定上下文中)SSE 代码要麻烦得多,因为 SSE 只有一种 128 位整数数据类型(即 __m128i
):
__m128i b = _mm_srli_si128(a,1);
这篇关于C 中的 ARM Neon:如何在使用内在函数时组合不同的 128 位数据类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!