C语言中的ARM Neon:如何在使用内部函数时合并不同的128位数据类型? [英] ARM Neon in C: How to combine different 128bit data types while using intrinsics?

查看:396
本文介绍了C语言中的ARM Neon:如何在使用内部函数时合并不同的128位数据类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TLTR

对于arm内部函数,如何将类型为uint8x16_t的128位变量输入到需要uint16x8_t的函数中?

For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?


扩展版本

上下文:我有一个灰度图像,每个像素1个字节.我想将其缩小2倍.对于每个2x2输入框,我要占用最小像素.在普通C语言中,代码将如下所示:

Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:

for (int y = 0; y < rows; y += 2) {
    uint8_t* p_out = outBuffer + (y / 2) * outStride;
    uint8_t* p_in = inBuffer + y * inStride;
    for (int x = 0; x < cols; x += 2) {
         *p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
         p_out++;
         p_in+=2;
    }
}

其中行和列都是2的倍数.我称跨度"是指从一个像素到图像中紧邻下方的像素的步长(以字节为单位).

Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.

现在,我要对此向量化.这个想法是:

Now I want to vectorize this. The idea is:

  1. 连续拍摄2像素行
  2. 从第一行的a加载16个字节,并在b
  3. 的紧下方加载16个字节
  4. 按字节计算ab之间的最小字节.储存在a.
  5. 创建a的副本,将其右移1个字节(8位).将其存储在b中.
  6. 按字节计算ab之间的最小字节.储存在a.
  7. a的第二个字节存储在输出图像中(丢弃一半字节)
  1. take 2 consecutive rows of pixels
  2. load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
  3. compute the minimum byte by byte between a and b. Store in a.
  4. create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
  5. compute the minimum byte by byte between a and b. Store in a.
  6. store every second byte of a in the output image (discards half of the bytes)

我想用Neon内在函数写这个.好消息是,对于每个步骤,都有一个与之匹配的内在函数.

I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.

例如,在第3点,可以使用(来自此处):

For example, at point 3 one can use (from here):

uint8x16_t  vminq_u8(uint8x16_t a, uint8x16_t b);

在第4点,可以使用以下8位之一(来自

And at point 4 one can use one of the following using a shift of 8 bits (from here):

uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);

那是因为我不在乎字节1,3,5,7,9,11,13,15会发生什么,因为无论如何它们将从最终结果中被丢弃. (此方法的正确性已得到验证,这不是问题的重点.)

That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)

但是,vminq_u8的输出是uint8x16_t类型,并且它与我要使用的shift内在函数不兼容.在C ++中,我通过此模板化数据结构解决了该问题,而我被告知该问题使用C类型的双关操作是允许的),也不能通过使用指针,因为它将破坏严格的别名规则.

HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union ( although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.

在使用ARM Neon内部函数时如何组合不同的数据类型?

What is the way to combine different data types while using ARM Neon intrinsics?

推荐答案

对于此类问题,arm_neon.h提供了

For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.

在某些情况下,您可能希望将向量视为具有 不同的类型,而不会更改其值.一组内在函数是 提供执行这种类型的转换.

In some situations, you might want to treat a vector as having a different type, without changing its value. A set of intrinsics is provided to perform this type of conversion.

因此,假设ab声明为:

uint8x16_t a, b;

您的第4点可以写为(*):

b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );

但是,请注意,不幸的是,这不能使用向量类型数组处理数据类型,请参见

However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

(*)应该说,这等效于(在此特定上下文中)SSE代码的麻烦,因为SSE只有一种128位整数数据类型(即__m128i):

(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):

__m128i b = _mm_srli_si128(a,1);

这篇关于C语言中的ARM Neon:如何在使用内部函数时合并不同的128位数据类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆