C 中的 ARM Neon:如何在使用内在函数时组合不同的 128 位数据类型? [英] ARM Neon in C: How to combine different 128bit data types while using intrinsics?

查看:25
本文介绍了C 中的 ARM Neon:如何在使用内在函数时组合不同的 128 位数据类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TLTR

对于 arm 内部函数,如何将 uint8x16_t 类型的 128 位变量输入到需要 uint16x8_t 的函数中?

For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?

<小时>扩展版

上下文:我有一个灰度图像,每个像素 1 个字节.我想将其缩小 2 倍.对于每个 2x2 输入框,我想取最小像素.在普通 C 中,代码如下所示:

Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:

for (int y = 0; y < rows; y += 2) {
    uint8_t* p_out = outBuffer + (y / 2) * outStride;
    uint8_t* p_in = inBuffer + y * inStride;
    for (int x = 0; x < cols; x += 2) {
         *p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
         p_out++;
         p_in+=2;
    }
}

其中行和列都是 2 的倍数.我将步幅"称为从一个像素到图像中紧邻下方的像素所需的字节步长.

Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.

现在我想将其矢量化.思路是:

Now I want to vectorize this. The idea is:

  1. 取2个连续的像素行
  2. 从最上面一行加载a中的16个字节,然后在b
  3. 中加载紧接其下的16个字节
  4. 逐字节计算ab之间的最小字节.存储在 a 中.
  5. 创建一个 a 的副本,将其右移 1 个字节(8 位).将其存储在 b 中.
  6. 逐字节计算ab之间的最小字节.存储在 a 中.
  7. 在输出图像中存储a的每第二个字节(丢弃一半的字节)
  1. take 2 consecutive rows of pixels
  2. load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
  3. compute the minimum byte by byte between a and b. Store in a.
  4. create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
  5. compute the minimum byte by byte between a and b. Store in a.
  6. store every second byte of a in the output image (discards half of the bytes)

我想用 Neon 内在函数来写这个.好消息是,对于每一步,都存在与其匹配的内在函数.

I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.

例如,在第 3 点可以使用(来自 此处):

For example, at point 3 one can use (from here):

uint8x16_t  vminq_u8(uint8x16_t a, uint8x16_t b);

在第 4 点,您可以使用以下 8 位移位之一(来自 这里):

And at point 4 one can use one of the following using a shift of 8 bits (from here):

uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);

那是因为我不在乎字节 1,3,5,7,9,11,13,15 会发生什么,因为无论如何它们都会从最终结果中被丢弃.(已经验证了正确性,不是问题所在.)

That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)

然而,vminq_u8 的输出是 uint8x16_t 类型,它与我想使用的移位内在函数不兼容.在 C++ 中,我用这个模板化数据结构解决了这个问题,而我被告知问题不能使用联合可靠地解决 (虽然该答案指的是 C++,实际上 在 C 类型双关语中是允许的),也不能使用指针强制转换,因为这会破坏严格的别名规则.

HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union ( although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.

在使用 ARM Neon 内在函数时如何组合不同的数据类型?

What is the way to combine different data types while using ARM Neon intrinsics?

推荐答案

针对这类问题,arm_neon.h 提供了vreinterpret{q}_dsttype_srctype 转换运算符.

For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.

在某些情况下,您可能希望将向量视为具有不同的类型,而不改变它的值.一组内在函数是用于执行此类转换.

In some situations, you might want to treat a vector as having a different type, without changing its value. A set of intrinsics is provided to perform this type of conversion.

所以,假设 ab 被声明为:

So, assuming a and b are declared as:

uint8x16_t a, b;

你的第 4 点可以写成(*):

Your point 4 can be written as(*):

b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );

但是,请注意,不幸的是,这并没有解决使用向量类型数组的数据类型,请参阅ARM Neon:如何从 uint8x16_t 转换为 uint8x8x2_t?

However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

<子>(*) 应该说,这比等效的(在此特定上下文中)SSE 代码要麻烦得多,因为 SSE 只有一种 128 位整数数据类型(即 __m128i):

__m128i b = _mm_srli_si128(a,1);

这篇关于C 中的 ARM Neon:如何在使用内在函数时组合不同的 128 位数据类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆