NEON、SSE 和交错加载与洗牌 [英] NEON, SSE and interleaving loads vs shuffles

查看：34 发布时间：2021/11/17 21:55:10 arm x86-64 sse neon

本文介绍了NEON、SSE 和交错加载与洗牌的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试理解Iwillnotexist Idonotexist"在使用 ARM NEON 内在函数对 cvtColor 进行 SIMD 优化的评论:

I'm trying to understand the comment made by "Iwillnotexist Idonotexist" at SIMD optimization of cvtColor using ARM NEON intrinsics:

... 为什么不使用映射到 VLD3 指令的 ARM NEON 内部函数?这使您免于所有的改组，既简化又加速了代码.英特尔 SSE 实现需要 shuffle，因为它缺少 2/3/4 路解交错加载指令，但您不应在它们可用时传递它们.

... why you don't use the ARM NEON intrisics that map to the VLD3 instruction? That spares you all of the shuffling, both simplifying and speeding up the code. The Intel SSE implementation requires shuffles because it lacks 2/3/4-way deinterleaving load instructions, but you shouldn't pass on them when they are available.

我遇到的问题是该解决方案提供非交错代码，并且它对浮点执行融合乘法.我试图将两者分开并仅理解交错负载.

The trouble I am having is the solution offers code that is non-interleaved, and it performs fused multiplies on floating points. I'm trying to separate the two and understand just the interleaved loads.

根据另一个问题的评论和 NEON 编码 - 第 1 部分:加载和存储，答案可能是使用 VLD3.

According to the other question's comment and Coding for NEON - Part 1: Load and Stores, the answer is probably going to use VLD3.

不幸的是，我只是没有看到它(可能是因为我对 NEON 及其内在功能不太熟悉).看起来 VLD3 基本上为每个输入产生 3 个输出，所以我的金属模型很混乱.

Unfortunately, I'm just not seeing it (probably because I'm less familiar with NEON and its intrinsic functions). It seems like VLD3 basically produces 3 outputs for each input, so my metal model is confused.

鉴于以下 SSE 内在函数对 BGR BGR BGR BGR... 格式的数据进行操作，需要对 BBBB GGGG RRRR ... 进行洗牌:

Given the following SSE instrinsics that operate on data in BGR BGR BGR BGR... format that needs a shuffle for BBBB GGGG RRRR ...:

const byte* data = ...  // assume 16-byte aligned
const __m128i mask = _mm_setr_epi8(0,3,6,9,12,15,1,4,7,10,13,2,5,8,11,14);
__m128i a = _mm_shuffle_epi8(_mm_load_si128((__m128i*)(data)),mask);

我们如何使用 NEON 内在函数执行交错加载，以便我们不需要 SSE shuffle?

How do we perform the interleaved loads using NEON intrinsics so that the we don't need the SSE shuffles?

另请注意...我对内在函数而不是 ASM 感兴趣.我可以在 MSVC、ICC、Clang 等下的 Windows Phone、Windows Store 和 Linux 驱动的设备上使用 ARM 的内在函数.我不能用 ASM 做到这一点，而且我不想专门化代码 3 次(Microsoft 32-位 ASM、Microsoft 64 位 ASM 和 GCC ASM).

Also note... I'm interested in intrinsics and not ASM. I can use ARM's intrinsics on Windows Phone, Windows Store, and Linux powered devices under MSVC, ICC, Clang, etc. I can't do that with ASM, and I'm not trying to specialize the code 3 times (Microsoft 32-bit ASM, Microsoft 64-bit ASM and GCC ASM).

NEON、SSE 和交错加载与洗牌 [英] NEON, SSE and interleaving loads vs shuffles

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

NEON、SSE 和交错加载与洗牌 [英] NEON, SSE and interleaving loads vs shuffles

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭