NEON、SSE 和交错加载与洗牌 [英] NEON, SSE and interleaving loads vs shuffles

查看:34
本文介绍了NEON、SSE 和交错加载与洗牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试理解Iwillnotexist Idonotexist"在使用 ARM NEON 内在函数对 cvtColor 进行 SIMD 优化的评论:

I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:

... 为什么不使用映射到 VLD3 指令的 ARM NEON 内部函数?这使您免于所有的改组,既简化又加速了代码.英特尔 SSE 实现需要 shuffle,因为它缺少 2/3/4 路解交错加载指令,但您不应在它们可用时传递它们.

... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.

我遇到的问题是该解决方案提供非交错代码,并且它对浮点执行融合乘法.我试图将两者分开并仅理解交错负载.

The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.

根据另一个问题的评论和 NEON 编码 - 第 1 部分:加载和存储,答案可能是使用 VLD3.

According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3.

不幸的是,我只是没有看到它(可能是因为我对 NEON 及其内在功能不太熟悉).看起来 VLD3 基本上为每个输入产生 3 个输出,所以我的金属模型很混乱.

Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3 basically produces 3 outputs for each input, so my metal model is confused.

鉴于以下 SSE 内在函数对 BGR BGR BGR BGR... 格式的数据进行操作,需要对 BBBB GGGG RRRR ... 进行洗牌:

Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR... format that needs a shuffle for BBBB GGGG RRRR ...:

const byte* data = ...  // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);

我们如何使用 NEON 内在函数执行交错加载,以便我们不需要 SSE shuffle?

How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?

另请注意...我对内在函数而不是 ASM 感兴趣.我可以在 MSVC、ICC、Clang 等下的 Windows Phone、Windows Store 和 Linux 驱动的设备上使用 ARM 的内在函数.我不能用 ASM 做到这一点,而且我不想专门化代码 3 次(Microsoft 32-位 ASM、Microsoft 64 位 ASM 和 GCC ASM).

Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).

推荐答案

根据本页:

您需要的 VLD3 内在函数是:

The VLD3 intrinsic you need is:

int8x8x3_t  vld3_s8(__transfersize(24) int8_t const * ptr);
// VLD3.8 {d0, d1, d2}, [r0]

如果在 ptr 指向的地址,你有这个数据:

If at address pointed by ptr you have this data:

0x00: 33221100
0x04: 77665544
0x08: bbaa9988
0x0c: ffddccbb
0x10: 76543210
0x14: fedcba98

你最终会进入登记册:

d0: ba54ffbb99663300
d1: dc7610ccaa774411
d2: fe9832ddbb885522

int8x8x3_t 结构体定义为:

The int8x8x3_t structure is defined as:

struct int8x8x3_t
{
   int8x8_t val[3];
};

这篇关于NEON、SSE 和交错加载与洗牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆