如何从 SSE2 __m128i 结构中提取字节? [英] How to extract bytes from an SSE2 __m128i structure?

查看:44
本文介绍了如何从 SSE2 __m128i 结构中提取字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 SIMD 内在函数的初学者,所以我会提前感谢大家的耐心等待.我有一个涉及无符号字节的绝对差异比较的应用程序(我正在处理灰度图像).

I'm a beginner with SIMD intrinsics, so I'll thank everyone for their patience in advance. I have an application involving absolute difference comparison of unsigned bytes (I'm working with greyscale images).

我尝试了 AVX、更现代的 SSE 版本等,但最终决定 SSE2 似乎足够并且对单个字节的支持最多 - 如果我错了,请纠正我.

I tried AVX, more modern SSE versions etc, but eventually decided SSE2 seems sufficient and has the most support for individual bytes - please correct me if I'm wrong.

我有两个问题:首先,加载 128 位寄存器的正确方法是什么?我想我应该传递与 128 的倍数对齐的负载内在数据,但这是否适用于这样的二维数组代码:

I have two questions: first, what's the right way to load 128-bit registers? I think I'm supposed to pass the load intrinsics data aligned to multiples of 128, but will that work with 2D array code like this:

greys = aligned_alloc(16, xres * sizeof(int8_t*));

for (uint32_t x = 0; x < xres; x++)
{
    greys[x] = aligned_alloc(16, yres * sizeof(int8_t*));
}

(上面的代码假设 xres 和 yres 相同,并且是 2 的幂).这会在内存中变成一个线性的、不间断的块吗?然后,当我循环时,我可以继续将地址(将它们增加 128)传递给 SSE2 加载内部函数吗?或者像这样的二维数组需要做一些不同的事情吗?

(The code above assumes xres and yres are the same, and are powers of two). Does this turn into a linear, unbroken block in memory? Could I then, as I loop, just keep passing addresses (incrementing them by 128) to the SSE2 load intrinsics? Or does something different need to be done for 2D arrays like this one?

我的第二个问题:一旦我完成了所有的向量处理,我该如何从 __m128i 中提取修改后的字节?查看英特尔内部指南,将向量类型转换为标量类型的指令很少见.我发现的最接近的是 int _mm_movemask_epi8 (__m128i a) 但我不太明白如何使用它.

My second question: once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i ? Looking through the Intel Intrinsics Guide, instructions that convert a vector type to a scalar one are rare. The closest I've found is int _mm_movemask_epi8 (__m128i a) but I don't quite understand how to use it.

哦,还有第三个问题 - 我假设 _mm_load_si128 只加载有符号字节?而且我找不到任何其他字节加载函数,所以我猜你应该从每个字节中减去 128 并在以后考虑它?

Oh, and one third question - I assumed _mm_load_si128 only loads signed bytes? And I couldn't find any other byte loading function, so I guess you're just supposed to subtract 128 from each and account for it later?

我知道这些是 SIMD 专家的基本问题,但我希望这对像我这样的初学者有用.如果您认为我对应用程序的整个方法是错误的,或者我最好使用更现代的 SIMD 扩展,我很想知道.我只想谦虚地警告我从来没有接触过汇编,如果要帮助我,所有这些琐碎的东西都需要大量的解释.

I know these are basic questions for SIMD experts, but I hope this one will be useful to beginners like me. And if you think my whole approach to the application is wrong, or I'd be better off with more modern SIMD extensions, I'd love to know. I'd just like to humbly warn I've never worked with assembly and all this bit-twiddling stuff requires a lot of explication if it's to help me.

尽管如此,我很感谢您提供任何澄清.

Nevertheless, I'm grateful for any clarification available.

以防万一:我的目标是低功耗 i7 Skylake 架构.但是让应用程序也能在更旧的机器上运行会很好(因此是 SSE2).

In case it makes a difference: I'm targeting a low-power i7 Skylake architecture. But it'd be nice to have the application run on much older machines too (hence SSE2).

推荐答案

最不明显的问题在先:

一旦我完成了所有的向量处理,我该如何从 __m128i

once I've done all my vector processing, how the heck do I extract the modified bytes from the __m128i

使用 int64_t _mm_cvtsi128_si64x(__m128i)低 32 位,int _mm_cvtsi128_si32 (__m128i a).

Extract the low 64 bits to an integer with int64_t _mm_cvtsi128_si64x(__m128i), or the low 32 bits with int _mm_cvtsi128_si32 (__m128i a).

如果你想要向量的其他部分,而不是低元素,你的选择是:

If you want other parts of the vector, not the low element, your options are:

  • 打乱向量以在低元素中创建一个新的 __m128i,并使用 cvt 内在函数(asm 中的 MOVD 或 MOVQ).

  • Shuffle the vector to create a new __m128i with the data you want in the low element, and use the cvt intrinsics (MOVD or MOVQ in asm).

使用 SSE2 int _mm_extract_epi16 (__m128i a, int imm8),或者SSE4.1类似的其他元素大小的指令,例如_mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) 不是最快的指令,但如果您只需要一个高元素,它们大约相当于单独的 shuffle 和 MOVD,但机器代码更小.

Use SSE2 int _mm_extract_epi16 (__m128i a, int imm8), or the SSE4.1 similar instructions for other element sizes such as _mm_extract_epi64(v, 1); (PEXTRB/W/D/Q) are not the fastest instructions, but if you only need one high element, they're about equivalent to a separate shuffle and MOVD, but smaller machine code.

_mm_store_si128 到对齐的临时数组并访问成员:如果使用 pextr* 指令>-msse4.1 或 -march=haswell 或其他.打印 __m128i 变量 显示了一个示例,包括显示 的 Godbolt 编译器输出_mm_store_si128alignas(16) uint64_t tmp[2]

_mm_store_si128 to an aligned temporary array and access the members: compilers will often optimize this into just a shuffle or pextr* instruction if you compile with -msse4.1 or -march=haswell or whatever. print a __m128i variable shows an example, including Godbolt compiler output showing _mm_store_si128 into an alignas(16) uint64_t tmp[2]

或者使用 union { __m128i v;int64_t i64[2];} 什么的.基于联合的类型双关在 C99 中是合法的,但仅作为 C++ 中的扩展.这与 tmp 数组的编译方式相同,通常不易阅读.

Or use union { __m128i v; int64_t i64[2]; } or something. Union-based type punning is legal in C99, but only as an extension in C++. This is compiles the same as a tmp array, and is generally not easier to read.

也可以在 C++ 中使用的联合的替代方法是 memcpy(&my_int64_local, 8 + (char*)my_vector, 8); 提取上半部分,但这似乎更多复杂且不太清楚,更可能是编译器无法看穿"的东西.当它是一个完整的变量时,编译器通常很擅长优化掉固定大小的小型 memcpy,但这只是变量的一半.

An alternative to the union that would also work in C++ would be memcpy(&my_int64_local, 8 + (char*)my_vector, 8); to extract the high half, but that seems more complicated and less clear, and more likely to be something a compiler wouldn't "see through". Compilers are usually pretty good about optimizing away small fixed-size memcpy when it's an entire variable, but this is just half of the variable.

如果向量的整个高半部分可以不加修改地直接进入内存(而不是在整数寄存器中需要),智能编译器可能会优化使用 MOVHPS 用于存储带有上述联合内容的 __m128i 的高半部分.

If the whole high half of a vector can go directly into memory unmodified (instead of being needed in an integer register), a smart compiler might optimize to use MOVHPS to store the high half of a __m128i with the above union stuff.

或者你可以使用 _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)).这只需要 SSE1,并且在大多数 CPU 上比 SSE4.1 pextrq 更有效.但是不要对您马上要再次使用的标量整数执行此操作;如果 SSE4.1 不可用,则编译器可能实际上会重新加载 MOVHPS 和整数,这通常不是最佳的.(而且一些编译器,如 MSVC,不会优化内在函数.)

Or you can use _mm_storeh_pi((__m64*)dst, _mm_castsi128_ps(vec)). That only requires SSE1, and is more efficient than SSE4.1 pextrq on most CPUs. But don't do this for a scalar integer you're about to use again right away; if SSE4.1 isn't available it's likely the compiler will actually MOVHPS and integer reload, which usually isn't optimal. (And some compilers like MSVC don't optimize intrinsics.)

这会在内存中变成一个线性的、不间断的块吗?

Does this turn into a linear, unbroken block in memory?

不,它是一个指向不同内存块的指针数组,相对于适当的 2D 数组,引入了额外的间接级别.不要那样做.

No, it's an array of pointers to separate blocks of memory, introducing an extra level of indirection vs. a proper 2D array. Don't do that.

做一个大的分配,自己做索引计算(使用array[x*yres + y]).

Make one large allocation, and do the index calculation yourself (using array[x*yres + y]).

是的,使用 _mm_load_si128 从中加载数据,如果需要从偏移量加载,则使用 loadu.

And yes, load data from it with _mm_load_si128, or loadu if you need to load from an offset.

假设 _mm_load_si128 只加载有符号字节

assumed _mm_load_si128 only loads signed bytes

有符号或无符号不是字节的固有属性,它只是您解释位的方式.您可以使用相同的加载内部函数来加载两个 64 位元素或一个 128 位位图.

Signed or unsigned isn't an inherent property of a byte, it's only how you interpret the bits. You use the same load intrinsic for loading two 64-bit elements, or a 128-bit bitmap.

使用适合您数据的内在函数.它有点像汇编语言:一切都只是字节,机器会用你的字节做你告诉它的事情.您可以选择一系列产生有意义结果的指令/内在函数.

Use intrinsics that are appropriate for your data. It's a little bit like assembly language: everything is just bytes, and the machine will do what you tell it with your bytes. It's up to you to choose a sequence of instructions / intrinsics that produces meaningful results.

整数加载内部函数采用 __m128i* 指针参数,因此您必须使用 _mm_load_si128( (const __m128i*) my_int_pointer ) 或类似方法.这看起来像指针别名(例如,通过 short * 读取 int 数组),这是 C 和 C++ 中的未定义行为.但是,这就是英特尔所说的您应该这样做的方式,因此任何实现英特尔内在函数的编译器都需要使其正常工作.gcc 通过使用 __attribute__((may_alias)) 定义 __m128i 来实现.

The integer load intrinsics take __m128i* pointer args, so you have to use _mm_load_si128( (const __m128i*) my_int_pointer ) or similar. This looks like pointer aliasing (e.g. reading an array of int through a short *), which is Undefined Behaviour in C and C++. However, this is how Intel says you're supposed to do it, so any compiler that implements Intel's intrinsics is required to make this work correctly. gcc does so by defining __m128i with __attribute__((may_alias)).

另请参阅为 GCC 的矢量扩展加载数据指出您可以将 Intel 内在函数用于 GNU C 本机矢量扩展,并展示了如何加载/存储.

See also Loading data for GCC's vector extensions which points out that you can use Intel intrinsics for GNU C native vector extensions, and shows how to load/store.

要了解更多关于带有 SSE 的 SIMD,在 标签 wiki,包括一些介绍/教程链接.

To learn more about SIMD with SSE, there are some links in the sse tag wiki, including some intro / tutorial links.

标签维基有一些不错的 x86asm/性能链接.

The x86 tag wiki has some good x86 asm / performance links.

这篇关于如何从 SSE2 __m128i 结构中提取字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆