对于 memcmp,SSE4.2 字符串指令比 SSE2 快多少? [英] How much faster are SSE4.2 string instructions than SSE2 for memcmp?

查看:33
本文介绍了对于 memcmp,SSE4.2 字符串指令比 SSE2 快多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码的汇编程序

Here is my code's assembler

你能把它嵌入到 C++ 中并检查 SSE4 吗?高速

Can you embed it in c ++ and check against SSE4? At speed

我很想看看如何步入 SSE4 的开发.还是根本不担心他?让我们检查一下(我不支持 SSSE3 以上)

I would very much like to see how stepped into the development of SSE4. Or is not worried about him at all? Let's check (I do not have support above SSSE3)

{ sse2 strcmp WideChar 32 bit }
function CmpSee2(const P1, P2: Pointer; len: Integer): Boolean;
asm
    push ebx           // Create ebx
    cmp EAX, EDX      // Str = Str2
    je @@true        // to exit true
    test eax, eax   // not Str
    je @@false     // to exit false
    test edx, edx // not Str2
    je @@false   // to exit false
    sub edx, eax              // Str2 := Str2 - Str;
    mov ebx, [eax]           // get Str 4 byte
    xor ebx, [eax + edx]    // Cmp Str2 4 byte
    jnz @@false            // Str <> Str2 to exit false
    sub ecx, 2            // dec 4
    { AnsiChar  : sub ecx, 4 }
    jbe @@true           // ecx <= 0 to exit true
    lea eax, [eax + 4]  // Next 4 byte
    @@To1:
    movdqa xmm0, DQWORD PTR [eax]       // Load Str 16 byte
    pcmpeqw xmm0, DQWORD PTR [eax+edx] // Load Str2 16 byte and cmp
    pmovmskb ebx, xmm0                // Mask cmp
    cmp ebx, 65535                   // Cmp mask
    jne @@Final                     // ebx <> 65535 to goto final
    add eax, 16                    // Next 16 byte
    sub ecx, 8                    // Skip 8 byte (16 wide)
    { AnsiChar  : sub ecx, 16 }
    ja @@To1                     // ecx > 0
    @@true:                       // Result true
    mov eax, 1                 // Set true
    pop ebx                   // Remove ebx
    ret                      // Return
    @@false:                  // Result false
    mov eax, 0             // Set false
    pop ebx               // Remove ebx
    ret                  // Return
    @@Final:
    cmp ecx, 7         // (ebx <> 65535) and (ecx > 7)
    { AnsiChar : cmp ecx, 15 }
    jae @@false       // to exit false
    movzx ecx, word ptr @@mask[ecx * 2 - 2] // ecx = mask[ecx]
    and ebx, ecx                           // ebx = ebx & ecx
    cmp ebx, ecx                          // ebx = ecx
    sete al                              // Equal / Set if Zero
    pop ebx                             // Remove ebx
    ret                                // Return
    @@mask: // array Mersenne numbers
    dw $000F, $003F, $00FF, $03FF, $0FFF, $3FFF
    { AnsiChar
    dw 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383
    }
end;

简单 32 位 https://vk.com/doc297044195_451679410

推荐答案

你调用了你的函数 strcmp,但你实际实现的是一个需要对齐的 memcmp(const void *a, const void *b, size_t 字).如果指针不是 16B 对齐的,movdqapcmpeqw xmm0, [mem] 都会出错.(实际上,如果 a+4 不是 16B 对齐的,因为您执行前 4 个标量并增加 4 个字节.)

You called your function strcmp, but what you've actually implemented is an alignment-required memcmp(const void *a, const void *b, size_t words). Both movdqa and pcmpeqw xmm0, [mem] will fault if the pointer isn't 16B-aligned. (Actually, if a+4 isn't 16B-aligned, because you do the first 4 scalar and increment by 4 bytes.)

使用正确的启动代码和 movdqu,您可以处理任意对齐(达到要用作 pcmpeqw 的内存操作数的指针的对齐边界).为方便起见,您可以要求两个指针都以宽字符对齐开始,但您不需要(特别是因为您只是返回真/假,而不是 negative/0/正 作为排序顺序.)

With the right startup code and movdqu, you could handle arbitrary alignments (reaching an alignment boundary for the pointer you want to use as a memory operand to pcmpeqw). For convenience, you could require that both pointers are wide-char-aligned to start with, but you don't need to (especially since you're just returning true/false, not negative / 0 / positive as a sort order.)

您是在询问 SSE2 pcmpeqwpcmpistrm 的性能,对吗?(显式长度的 SSE4.2 指令,如 pcmpestrm 的吞吐量比隐式长度版本要差,因此当您不接近字符串末尾时,请在主循环中使用隐式长度版本.参见 Agner Fog 的指令表和微架构指南).

You're asking about performance of SSE2 pcmpeqw vs. pcmpistrm, right? (The explicit-length SSE4.2 instructions like pcmpestrm have worse throughput than the implicit-length versions, so use the implicit-length versions in your main loop when you're not close to the end of the string. See Agner Fog's instruction tables and microarch guide).

对于 memcmp(或精心实施的 strcmp),在大多数 CPU 上,SSE4.2 所能达到的最佳效果比 SSE2(或 SSSE3)所能达到的最佳效果要慢.可能对非常短的字符串有用,但对 memcmp 的主循环无效.

For memcmp (or carefully-implemented strcmp), the best you can do with SSE4.2 is slower than the best you can do with SSE2 (or SSSE3) on most CPUs. Maybe useful for very short strings, but not for the main loop of memcmp.

在 Nehalem 上:pcmpistri 是 4 uops,2c 吞吐量(带有内存操作数),因此没有其他循环开销,它可以跟上内存.(Nehalem 只有 1 个装载端口).pcmpestri 有 6c 的吞吐量:慢 3 倍.

On Nehalem: pcmpistri is 4 uops, 2c throughput (with a memory operand), so with no other loop overhead, it can keep up with memory. (Nehalem only has 1 load port). pcmpestri has 6c throughput: 3x slower.

在 Sandybridge 上通过 Skylake,pcmpistri xmm0, [eax] 具有 3c 吞吐量,因此它太慢了 3 倍,无法跟上每个时钟 1 个向量(2 个加载端口).pcmpestri 在其中大多数情况下的吞吐量为 4c,因此并没有那么糟糕.(可能对最后一个部分向量有用,但在主循环中无效).

On Sandybridge through Skylake, pcmpistri xmm0, [eax] has 3c throughput, so it's a factor of 3 too slow to keep up with 1 vector per clock (2 load ports). pcmpestri has 4c throughput on most of those, so it's not as much worse. (Maybe useful for the last partial-vector, but not in the main loop).

在 Silvermont/KNL 上,pcmpistrm 是最快的,并且以每 14 个周期的吞吐量运行一次,因此对于简单的东西来说,这完全是垃圾.

On Silvermont/KNL, pcmpistrm is the fastest, and runs at one per 14 cycle throughput, so it's total garbage for simple stuff.

在 AMD Jaguar 上,pcmpistri 是 2c 吞吐量,因此它实际上可能可用(只有一个加载端口).pcmpestri 是 5c 的吞吐量,所以很糟糕.

On AMD Jaguar, pcmpistri is 2c throughput, so it might actually be usable (only one load port). pcmpestri is 5c throughput, so it sucks.

在 AMD Ryzen 上,pcmpistri 也是 2c 的吞吐量,所以它在那里很糟糕.(2 个加载端口和每个时钟前端吞吐量 5 uop(或 6 uop,如果有(或全部?)来自多 uop 指令)意味着您可以更快.

On AMD Ryzen, pcmpistri is also 2c throughput, so it's crap there. (2 load ports and 5 uops per clock front-end throughput (or 6 uops if any (or all?) are from multi-uop instructions) mean you can go faster.

在 AMD Bulldozer 系列上,pcmpistri 的吞吐量为 3c,直到 Steamroller 为 5c.pcmpestri 有 10c 的吞吐量.它们被微编码为 7 或 27 m-ops,因此 AMD 没有在它们上花费大量硅.

On AMD Bulldozer-family, pcmpistri has 3c throughput until Steamroller, where it's 5c. pcmpestri has 10c throughput. They're micro-coded as 7 or 27 m-ops, so AMD didn't spend a lot of silicon on them.

在大多数 CPU 上,只有当您充分利用它们来处理仅用 pcmpeq/pmovmskb.但是,如果您可以使用 AVX2 或特别是 AVX512BW,那么在更宽的向量上使用更多指令,即使是做复杂的事情也可能会更快.(SSE4.2 字符串指令没有更广泛的版本.)也许 SSE4.2 字符串指令对于通常处理短字符串的函数仍然有用,因为宽向量循环通常需要更多的启动/清理开销.此外,在一个不会在 SIMD 循环中花费太多时间的程序中,在一个小函数中使用 AVX 或 AVX512 仍会降低下一毫秒左右的最大涡轮时钟速度,并且很容易成为净损失.

On most CPUs, they're only worth it if you're taking full advantage of them for stuff you can't do with just pcmpeq/pmovmskb. But if you can use AVX2 or especially AVX512BW, even doing complicated things might be faster with more instructions on wider vectors. (There are no wider versions of the SSE4.2 string instructions.) Maybe the SSE4.2 string instructions are still useful for functions that usually deal with short strings, because wide vector loops usually need more startup / cleanup overhead. Also, in a program that doesn't spend much time in SIMD loops, using AVX or AVX512 in one small function will still reduce your max turbo clock speed for the next millisecond or so, and could easily be a net loss.

一个好的内循环应该是负载吞吐量的瓶颈,或者尽可能接近.movqdu/pcmpeqw [one-register]/pmovmskb/macro-fused-cmp+jcc 只有 4 个融合域 uops,所以这是在 Sandybridge 系列 CPU 上几乎可以实现

A good inner loop should bottleneck on load throughput, or come as close as possible. movqdu / pcmpeqw [one-register] / pmovmskb/ macro-fused-cmp+jcc is only 4 fused-domain uops, so this is almost achievable on Sandybridge-family CPUs

有关实施和一些基准测试,但这是针对 C 风格的隐式长度字符串,您必须检查 0 字节.看起来您正在使用显式长度的字符串,因此在检查长度相等后,它只是 memcmp.(或者我想如果你需要找到排序顺序而不是相等/不相等,你必须 memcmp 到较短字符串的末尾.)

See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for an implementation and some benchmarks, but that's for C-style implicit-length strings where you have to check for 0 bytes. It looks like you're using explicit-length strings, so after checking that the lengths are equal, it's just memcmp. (Or I guess if you need to find the sort order instead of just equal / not-equal, you'd have to memcmp out to the end of the shorter string.)

对于带有 8 位字符串的 strcmp,在大多数 CPU 上,不使用 SSE4.2 字符串指令会更快.有关某些基准测试(隐式长度字符串版本),请参阅 strchr.com 文章中的评论.例如,glibc 不使用 strcmp 的 SSE4.2 字符串指令,因为它们在大多数 CPU 上并没有更快.不过,它们可能是 strstr 的胜利.

For strcmp with 8-bit strings, on most CPUs it's faster not to use the SSE4.2 string instructions. See the comments on the strchr.com article for some benchmarks (of that implicit-length string version). glibc for example doesn't use the SSE4.2 string instructions for strcmp, because they're not faster on most CPUs. They might be a win for strstr though.

glibc 有几个 SSE2/SSSE3 asm strcmpmemcmp 实现.(它是LGPLed的,所以你不能把它复制到非GPL项目中,而是看看他们是做什么的.)一些字符串函数(如strlen)每64个字节才分支,然后回来整理缓存行中的哪个字节命中.但是他们的 memcmp 实现只是用 movdqu/pcmpeqb 展开.您可以使用 pcmpeqw,因为您想知道第一个不同的 16 位元素的位置,而不是第一个字节.

glibc has several SSE2/SSSE3 asm strcmp and memcmp implementations. (It's LGPLed, so you can't just copy it into non-GPL projects, but have a look at what they do.) Some of the string functions (like strlen) only branch per 64 bytes, and then come back to sort out which byte within the cache line had the hit. But their memcmp implementation just unrolls with movdqu / pcmpeqb. You can use pcmpeqw since you want to know the position of the first 16-bit element that's different, rather than the first byte.

您的 SSE2 实施可能会更快.您应该在 movdqa 中使用索引寻址模式,因为它不会与 pcmpeqw 微融合(在 Intel Sandybridge/Ivybridge 上;在 Nehalem 或 Haswell+ 上很好),但是 pcmpeqw xmm0, [eax] 将保持微- 熔合而不分层.

Your SSE2 implementation could be even faster. You should use the indexed addressing mode with movdqa since it won't micro-fuse with pcmpeqw (on Intel Sandybridge/Ivybridge; fine on Nehalem or Haswell+), but pcmpeqw xmm0, [eax] will stay micro-fused without unlaminating.

您应该展开几次以减少循环开销.您应该将指针增量与循环计数器结合起来,以便您 cmp/jb 而不是 sub/ja:在更多 CPU 上进行宏融合,并避免写入寄存器(减少寄存器重命名所需的物理寄存器数量).

You should unroll a couple times to reduce loop overhead. You should combine the pointer-increment with the loop counter so you cmp/jb instead of sub/ja: macro-fusion on more CPUs, and avoids writing a register (reducing the amount of physical registers needed for register-renaming).

您的内部循环在英特尔 Sandybridge/Ivybridge 上运行

Your inner loop, on Intel Sandybridge/Ivybridge, will run

@@To1:
movdqa xmm0, DQWORD PTR [eax]       // 1 uop
pcmpeqw xmm0, DQWORD PTR [eax+edx] // 2 uops on Intel SnB/IvB, 1 on Nehalem and earlier or Haswell and later.
pmovmskb ebx, xmm0                // 1 uop
cmp ebx, 65535
jne @@Final                     // 1 uop  (macro-fused with cmp)
add eax, 16                    // 1 uop
sub ecx, 8
{ AnsiChar  : sub ecx, 16 }
ja @@To1                     // 1 uop (macro-fused with sub on SnB and later, otherwise 2)

这是 7 个融合域 uops,因此在主流 Intel CPU 上每次迭代最多只能从前端发出 7/4 个周期.这与每个时钟 2 个负载的瓶颈相去甚远.在 Haswell 及更高版本上,每次迭代是 6/4 个周期,因为索引寻址模式可以与像 pcmpeqw 这样的 2 操作数加载修改指令保持微融合,但不能与其他任何指令(如 paswxmm0、[eax+edx](不读取目的地)或 AVX vpcmpeqw xmm0、xmm0、[eax+edx](3 个操作数)).请参阅微融合和寻址模式.

This is 7 fused-domain uops, so it can only issue from the front-end at best 7/4 cycles per iteration on mainstream Intel CPUs. This is very far from bottlenecking on 2 loads per clock. On Haswell and later, it's 6/4 cycles per iteration, because indexed addressing modes can stay micro-fused with 2-operand load-modify instruction like pcmpeqw, but not anything else (like pabsw xmm0, [eax+edx] (doesn't read destination) or AVX vpcmpeqw xmm0, xmm0, [eax+edx] (3 operands)). See Micro fusion and addressing modes.

这对于具有更好设置/清理的小字符串也可能更有效.

在您的指针设置代码中,如果您首先检查 NULL 指针,您可以保存一个 cmp.您可以 sub/jne 减去 使用相同的宏融合比较和分支检查两者是否相等.(它只会在 Intel Sandybridge 系列上进行宏融合,并且只有 Haswell 可以在单个解码块中进行 2 次宏融合.但是 Haswell/Broadwell/Skylake CPU 很常见并且变得越来越普遍,这对其他CPU 除非相等指针是如此普遍以至于首先进行检查很重要.)

In your pointer-setup code, you could save a cmp if you check for NULL pointers first. You can sub / jne to subtract and check for both equal with the same macro-fused compare and branch. (It will only macro-fuse on Intel Sandybridge-family, and only Haswell can make 2 macro-fusions in a single decode block. But Haswell/Broadwell/Skylake CPUs are common and becoming ever more common, and this has no downside for other CPUs unless equal-pointers is so common that doing that check first matters.)

在您的返回路径中:始终使用 xor eax,eax 尽可能将寄存器清零,而不是 mov eax, 0.

In your return path: Always use xor eax,eax to zero a register whenever possible, not mov eax, 0.

您似乎并没有避免从字符串末尾开始读取.您应该使用刚好在页面末尾结束的字符串来测试您的函数,其中下一页未映射.

You don't seem to avoid reading from past the end of the string. You should test your function with strings that end right at the end of a page, where the next page is unmapped.

xor ebx, [eax + edx] 在早出标量测试中比 cmp 具有零优势.cmp/jnz 可以与 jcc 进行宏融合,但 xor 不能.

xor ebx, [eax + edx] has zero advantages over cmp for the early-out scalar test. cmp/jnz can macro-fuse with the jcc, but xor can't.

您加载一个掩码来处理清理,以涵盖您阅读超过字符串末尾的情况.您可能仍然可以使用通常的 bsf 来查找位图中的第一个差异.我想用 not 反转它以找到第一个不相等的位置,并检查它是否小于剩余的字符串长度.

You load a mask to handle the cleanup to cover the case where you read past the end of the string. You could probably still use the usual bsf to find the first difference in the bitmap. I guess invert it with not to find the first position that didn't compare equal, and check that that's less than the remaining string length.

或者你可以使用 mov eax, -1shr​​ 即时生成掩码,我想.或者为了加载它,您有时可以使用滑动窗口进入 ...,0,0,0,-1,-1,-1,... 数组,但您需要子字节偏移量,因此不起作用.(如果您想屏蔽和重做 pmovmskb,它适用于矢量蒙版.使用未对齐的缓冲区进行向量化:使用 VMASKMOVPS:根据未对齐计数生成掩码?或者根本不使用该 insn).

Or you could generate the mask on the fly with mov eax, -1 and shr, I think. Or for loading it, you can sometimes use a sliding window into a ...,0,0,0,-1,-1,-1,... array, but you need sub-byte offsets so that doesn't work. (It works well for vector masks, if you wanted to mask and redo the pmovmskb. Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all).

你的方式还不错,只要它不缓存未命中.我可能会去即时生成面具.也许在另一个寄存器中的循环之前,因为您可以屏蔽以获得count % 8,因此屏蔽生成可以与循环并行发生.

Your way isn't bad, as long as it doesn't cache miss. I'd probably go for generating the mask on the fly. Maybe before the loop in another register, because you can mask to get count % 8, so the mask-generation can happen in parallel with the loop.

这篇关于对于 memcmp,SSE4.2 字符串指令比 SSE2 快多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆