如何使用_mm_extract_epi8 函数? [英] How to use _mm_extract_epi8 function?

查看:48
本文介绍了如何使用_mm_extract_epi8 函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 _mm_extract_epi8 (__m128i a, const int imm8) 函数,该函数具有 const int 参数.当我编译此 C++ 代码时,收到以下错误消息:

<块引用>

错误 C2057 预期常量表达式

__m128i a;for (int i=0; i<16; i++){_mm_extract_epi8(a, i);//编译错误}

如何在循环中使用这个函数?

解决方案

首先,即使有可能,您也不希望在循环中使用它,并且您不想完全展开循环16x pextrb.该指令在 Intel 和 AMD CPU 上花费 2 uops,并且会在 shuffle 端口(以及用于 vec->int 数据传输的端口 0)上出现瓶颈.

_mm_extract_epi8 内在函数需要编译时常量索引,因为 pextrb r32/m8, xmm, imm8 指令 仅可用于将索引作为立即数(嵌入到指令的机器代码中).

<小时>

如果您想放弃 SIMD 并在向量元素上编写一个标量循环,对于这么多元素,您应该存储/重新加载.所以你应该在 C++ 中这样写:

alignas(16) int8_t bytes[16];//或 uint8_t_mm_store_si128((__m128i*)bytes, vec);for(int i=0 ; i<16 ; i++) {foo(字节[i]);}

一家商店的成本(以及商店转发延迟)在 16 次重新加载时分摊,每次只需花费 1 movsx eax、byte ptr [rsp+16] 或其他任何费用.(英特尔和锐龙 1 uop).或者在重新加载时使用 uint8_tmovzx 零扩展到 32 位.现代 CPU 每个时钟可以运行 2 个加载微指令,并且向量存储 -> 标量重新加载存储转发是高效的(约 6 或 7 个周期延迟).

<小时>

对于 64 位元素,movq + pextrq 几乎肯定是您最好的选择.存储 + 重新加载与前端成本相当,但延迟比提取更糟糕.

对于 32 位元素,根据您的循环,它更接近收支平衡.如果循环体很小,展开的 ALU 提取可能会很好.或者您可以存储/重新加载但使用 _mm_cvtsi128_si32 (movd) 执行第一个元素,以降低第一个元素的延迟,以便 CPU 可以在存储时处理该元素-发生高元素的转发延迟.

对于 16 位或 8 位元素,如果您需要遍历所有 8 位或 16 位元素,几乎肯定会更好地存储/重新加载.

如果您的循环对每个元素进行非内联函数调用,则 Windows x64 调用约定具有一些保留调用的 XMM 寄存器,但 x86-64 System V 没有.因此,如果您的 XMM reg 需要围绕函数调用溢出/重新加载,那么只进行标量加载会更好,因为无论如何编译器都会将它保存在内存中.(希望它可以优化掉它的第二个副本,或者您可以声明一个联合.)

打印 __m128i 变量,用于所有元素大小的工作存储 + 标量循环.

<小时>

如果你真的想要一个水平总和,或者最小值或最大值,你可以用 O(log n) 步的 shuffle 来完成,而不是 n 次标量循环迭代. 在 x86 上进行水平浮点矢量求和的最快方法(还提到了 32-位整数).

对于对字节元素求和,SSE2 有一个特殊情况 _mm_sad_epu8(vec, _mm_setzero_si128()).无溢出的无符号字节总和减少,使用 SSE2英特尔.

您还可以通过将范围转换为无符号然后从总和中减去 16*0x80 来使用它来处理有符号字节.https://github.com/pcordes/vectorclass/commit/630ca802bb1abefd096907f8457d090c28c8327b">

I am using _mm_extract_epi8 (__m128i a, const int imm8) function, which has const int parameter. When I compile this c++ code, getting the following error message:

Error C2057 expected constant expression

__m128i a;

for (int i=0; i<16; i++)
{
    _mm_extract_epi8(a, i); // compilation error
}

How could I use this function in loop?

First of all, you wouldn't want to use it in a loop even if it was possible, and you wouldn't want to fully unroll a loop with 16x pextrb. That instruction costs 2 uops on Intel and AMD CPUs, and will bottleneck on the shuffle port (and port 0 for vec->int data transfer).

The _mm_extract_epi8 intrinsic requires a compile-time constant index because the pextrb r32/m8, xmm, imm8 instruction is only available with the index as an immediate (embedded into the machine code of the instruction).


If you want to give up on SIMD and write a scalar loop over vector elements, for this many elements you should store/reload. So you should write it that way in C++:

alignas(16) int8_t bytes[16];  // or uint8_t
_mm_store_si128((__m128i*)bytes, vec);
for(int i=0 ; i<16 ; i++) {
    foo(bytes[i]);
}

The cost of one store (and the store-forwarding latency) is amortized over 16 reloads which only cost 1 movsx eax, byte ptr [rsp+16] or whatever each. (1 uop on Intel and Ryzen). Or use uint8_t for movzx zero-extension to 32-bit in the reloads. Modern CPUs can run 2 load uops per clock, and vector-store -> scalar reload store forwarding is efficient (~6 or 7 cycle latency).


With 64-bit elements, movq + pextrq is almost certainly your best bet. Store + reloads are comparable cost for the front-end and worse latency than extract.

With 32-bit elements, it's closer to break even depending on your loop. An unrolled ALU extract could be good if the loop body is small. Or you might store/reload but do do the first element with _mm_cvtsi128_si32 (movd) for low latency on the first element so the CPU can be working on that while the store-forwarding latency for the high elements happens.

With 16-bit or 8-bit elements, it's almost certainly better to store/reload if you need to loop over all 8 or 16 elements.

If your loop makes a non-inline function call for each element, the Windows x64 calling convention has some call-preserved XMM registers, but x86-64 System V doesn't. So if your XMM reg would need to be spilled/reloaded around a function call, it's much better to just do scalar loads since the compiler will have it in memory anyway. (Hopefully it can optimize away the 2nd copy of it, or you could declare a union.)

See print a __m128i variable for working store + scalar loops for all element sizes.


If you actually want a horizontal sum, or min or max, you can do it with shuffles in O(log n) steps, rather than n scalar loop iterations. Fastest way to do horizontal float vector sum on x86 (also mentions 32-bit integer).

And for summing byte elements, SSE2 has a special case of _mm_sad_epu8(vec, _mm_setzero_si128()). Sum reduction of unsigned bytes without overflow, using SSE2 on Intel.

You can also use that to do signed bytes by range-shifting to unsigned and then subtracting 16*0x80 from the sum. https://github.com/pcordes/vectorclass/commit/630ca802bb1abefd096907f8457d090c28c8327b

这篇关于如何使用_mm_extract_epi8 函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆