对于memcmp来说,SSE4.2字符串指令比SSE2快多少? [英] How much faster are SSE4.2 string instructions than SSE2 for memcmp?

查看:541
本文介绍了对于memcmp来说,SSE4.2字符串指令比SSE2快多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码的汇编程序

您可以将其嵌入c ++并针对SSE4进行检查吗?高速

我非常想看看如何步入SSE4的开发.还是根本不担心他?让我们检查一下(我在SSSE3以上没有支持)

{ sse2 strcmp WideChar 32 bit }
function CmpSee2(const P1, P2: Pointer; len: Integer): Boolean;
asm
    push ebx           // Create ebx
    cmp EAX, EDX      // Str = Str2
    je @@true        // to exit true
    test eax, eax   // not Str
    je @@false     // to exit false
    test edx, edx // not Str2
    je @@false   // to exit false
    sub edx, eax              // Str2 := Str2 - Str;
    mov ebx, [eax]           // get Str 4 byte
    xor ebx, [eax + edx]    // Cmp Str2 4 byte
    jnz @@false            // Str <> Str2 to exit false
    sub ecx, 2            // dec 4
    { AnsiChar  : sub ecx, 4 }
    jbe @@true           // ecx <= 0 to exit true
    lea eax, [eax + 4]  // Next 4 byte
    @@To1:
    movdqa xmm0, DQWORD PTR [eax]       // Load Str 16 byte
    pcmpeqw xmm0, DQWORD PTR [eax+edx] // Load Str2 16 byte and cmp
    pmovmskb ebx, xmm0                // Mask cmp
    cmp ebx, 65535                   // Cmp mask
    jne @@Final                     // ebx <> 65535 to goto final
    add eax, 16                    // Next 16 byte
    sub ecx, 8                    // Skip 8 byte (16 wide)
    { AnsiChar  : sub ecx, 16 }
    ja @@To1                     // ecx > 0
    @@true:                       // Result true
    mov eax, 1                 // Set true
    pop ebx                   // Remove ebx
    ret                      // Return
    @@false:                  // Result false
    mov eax, 0             // Set false
    pop ebx               // Remove ebx
    ret                  // Return
    @@Final:
    cmp ecx, 7         // (ebx <> 65535) and (ecx > 7)
    { AnsiChar : cmp ecx, 15 }
    jae @@false       // to exit false
    movzx ecx, word ptr @@mask[ecx * 2 - 2] // ecx = mask[ecx]
    and ebx, ecx                           // ebx = ebx & ecx
    cmp ebx, ecx                          // ebx = ecx
    sete al                              // Equal / Set if Zero
    pop ebx                             // Remove ebx
    ret                                // Return
    @@mask: // array Mersenne numbers
    dw $000F, $003F, $00FF, $03FF, $0FFF, $3FFF
    { AnsiChar
    dw 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383
    }
end;

Semple 32bit https://vk.com/doc297044195_451679410

解决方案

您调用了函数strcmp,但实际上实现的是需要对齐的memcmp(const void *a, const void *b, size_t words).如果指针未与16B对齐,则movdqapcmpeqw xmm0, [mem]都将出错. (实际上,如果a+4不是16B对齐的,因为您执行了前4个标量并增加了4个字节.)

使用正确的启动代码和movdqu,您可以处理任意对齐(达到要用作pcmpeqw的内存操作数的指针的对齐边界).为方便起见,您可能要求两个指针都以宽字符对齐开头,但您不必这样做(尤其是因为您只是返回true/false,而不是negative / 0 / positive作为排序顺序.)


您正在询问SSE2 pcmpeqwpcmpistrm的性能,对吗? (如之类的显式SSE4.2指令pcmpestrm的吞吐量比隐式长度版本更糟糕,因此当您不靠近字符串末尾时,请在主循环中使用隐式长度版本.请参见 https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 一些基准,但这是针对C型隐式长度字符串的,您必须在其中检查0字节.看起来您正在使用显式长度的字符串,因此在检查长度相等之后,它只是memcmp. (或者我想,如果您需要找到排序顺序而不是仅仅等于/不等于,那么您必须将memcmp移到较短的字符串的末尾.)

对于具有8位字符串的strcmp,在大多数CPU上,不使用SSE4.2字符串指令会更快.请参阅strchr.com文章上的评论以获取一些基准测试(该隐式长度字符串版本的基准测试).例如,glibc不对strcmp使用SSE4.2字符串指令,因为它们在大多数CPU上的运行速度都不快.他们可能是strstr的胜利.


glibc具有多个SSE2/SSSE3 asm strcmp memcmp实现. (它是LGPL格式的,因此您不能将其复制到非GPL项目中,而要看看它们的作用.)一些字符串函数(如strlen)仅每64个字节分支一次,然后返回进行排序缓存行中的哪个字节被命中.但是他们的memcmp实现只是通过movdqu/pcmpeqb展开.您可以使用pcmpeqw,因为您想知道第一个不同的第一个16位元素的位置,而不是第一个字节.


您的SSE2实现可能更快.您应该对movdqa使用索引寻址模式,因为它不会与pcmpeqw微融合(在Intel Sandybridge/Ivybridge上;在Nehalem或Haswell +上很好),但是pcmpeqw xmm0, [eax]将保持微融合而不会分层.

您应该展开几次以减少循环开销.您应该将指针增量与循环计数器结合使用,以便在更多CPU上使用cmp/jb而不是sub/ja:宏融合,并避免写入寄存器(减少寄存器重新命名所需的物理寄存器数量).

您在Intel Sandybridge/Ivybridge上的内部循环将运行

@@To1:
movdqa xmm0, DQWORD PTR [eax]       // 1 uop
pcmpeqw xmm0, DQWORD PTR [eax+edx] // 2 uops on Intel SnB/IvB, 1 on Nehalem and earlier or Haswell and later.
pmovmskb ebx, xmm0                // 1 uop
cmp ebx, 65535
jne @@Final                     // 1 uop  (macro-fused with cmp)
add eax, 16                    // 1 uop
sub ecx, 8
{ AnsiChar  : sub ecx, 16 }
ja @@To1                     // 1 uop (macro-fused with sub on SnB and later, otherwise 2)

这是7个融合域uops,因此它只能在主流Intel CPU上以每次迭代的最佳7/4周期从前端发出.这与每个时钟2个负载的瓶颈相去甚远.在Haswell及更高版本上,每个迭代为6/4个周期,因为索引寻址模式可以与2操作数加载修改指令(如pcmpeqw)保持微融合,但不能与其他任何内容(如pabsw xmm0, [eax+edx](不读取目标位置) )或AVX vpcmpeqw xmm0, xmm0, [eax+edx](3个操作数)).参见微融合和寻址模式.


这对于设置和清理效果更好的小型字符串也可能更为有效.

在指针设置代码中,如果先检查NULL指针,则可以保存cmp.您可以sub/jne减去,以相同的宏合并比较和分支来检查两者是否相等. (它只会在Intel Sandybridge系列上进行宏熔断,并且只有Haswell可以在单个解码块中进行2个宏熔断.但是Haswell/Broadwell/Skylake CPU很常见,并且变得越来越普遍,这对其他处理器没有任何不利影响. CPU,除非等号指针如此普遍,以至于首先进行检查很重要.)


在返回路径中:始终使用使用未对齐的缓冲区进行矢量化:使用VMASKMOVPS:从未对齐计数生成掩码吗?还是根本不使用该insn ).

您的方法还不错,只要它不会缓存未命中.我可能会即时生成蒙版.也许在另一个寄存器中的之前,因为您可以通过掩码获取count % 8,所以掩码生成可以与循环并行发生.

Here is my code's assembler

Can you embed it in c ++ and check against SSE4? At speed

I would very much like to see how stepped into the development of SSE4. Or is not worried about him at all? Let's check (I do not have support above SSSE3)

{ sse2 strcmp WideChar 32 bit }
function CmpSee2(const P1, P2: Pointer; len: Integer): Boolean;
asm
    push ebx           // Create ebx
    cmp EAX, EDX      // Str = Str2
    je @@true        // to exit true
    test eax, eax   // not Str
    je @@false     // to exit false
    test edx, edx // not Str2
    je @@false   // to exit false
    sub edx, eax              // Str2 := Str2 - Str;
    mov ebx, [eax]           // get Str 4 byte
    xor ebx, [eax + edx]    // Cmp Str2 4 byte
    jnz @@false            // Str <> Str2 to exit false
    sub ecx, 2            // dec 4
    { AnsiChar  : sub ecx, 4 }
    jbe @@true           // ecx <= 0 to exit true
    lea eax, [eax + 4]  // Next 4 byte
    @@To1:
    movdqa xmm0, DQWORD PTR [eax]       // Load Str 16 byte
    pcmpeqw xmm0, DQWORD PTR [eax+edx] // Load Str2 16 byte and cmp
    pmovmskb ebx, xmm0                // Mask cmp
    cmp ebx, 65535                   // Cmp mask
    jne @@Final                     // ebx <> 65535 to goto final
    add eax, 16                    // Next 16 byte
    sub ecx, 8                    // Skip 8 byte (16 wide)
    { AnsiChar  : sub ecx, 16 }
    ja @@To1                     // ecx > 0
    @@true:                       // Result true
    mov eax, 1                 // Set true
    pop ebx                   // Remove ebx
    ret                      // Return
    @@false:                  // Result false
    mov eax, 0             // Set false
    pop ebx               // Remove ebx
    ret                  // Return
    @@Final:
    cmp ecx, 7         // (ebx <> 65535) and (ecx > 7)
    { AnsiChar : cmp ecx, 15 }
    jae @@false       // to exit false
    movzx ecx, word ptr @@mask[ecx * 2 - 2] // ecx = mask[ecx]
    and ebx, ecx                           // ebx = ebx & ecx
    cmp ebx, ecx                          // ebx = ecx
    sete al                              // Equal / Set if Zero
    pop ebx                             // Remove ebx
    ret                                // Return
    @@mask: // array Mersenne numbers
    dw $000F, $003F, $00FF, $03FF, $0FFF, $3FFF
    { AnsiChar
    dw 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383
    }
end;

Semple 32bit https://vk.com/doc297044195_451679410

解决方案

You called your function strcmp, but what you've actually implemented is an alignment-required memcmp(const void *a, const void *b, size_t words). Both movdqa and pcmpeqw xmm0, [mem] will fault if the pointer isn't 16B-aligned. (Actually, if a+4 isn't 16B-aligned, because you do the first 4 scalar and increment by 4 bytes.)

With the right startup code and movdqu, you could handle arbitrary alignments (reaching an alignment boundary for the pointer you want to use as a memory operand to pcmpeqw). For convenience, you could require that both pointers are wide-char-aligned to start with, but you don't need to (especially since you're just returning true/false, not negative / 0 / positive as a sort order.)


You're asking about performance of SSE2 pcmpeqw vs. pcmpistrm, right? (The explicit-length SSE4.2 instructions like pcmpestrm have worse throughput than the implicit-length versions, so use the implicit-length versions in your main loop when you're not close to the end of the string. See Agner Fog's instruction tables and microarch guide).

For memcmp (or carefully-implemented strcmp), the best you can do with SSE4.2 is slower than the best you can do with SSE2 (or SSSE3) on most CPUs. Maybe useful for very short strings, but not for the main loop of memcmp.

On Nehalem: pcmpistri is 4 uops, 2c throughput (with a memory operand), so with no other loop overhead, it can keep up with memory. (Nehalem only has 1 load port). pcmpestri has 6c throughput: 3x slower.

On Sandybridge through Skylake, pcmpistri xmm0, [eax] has 3c throughput, so it's a factor of 3 too slow to keep up with 1 vector per clock (2 load ports). pcmpestri has 4c throughput on most of those, so it's not as much worse. (Maybe useful for the last partial-vector, but not in the main loop).

On Silvermont/KNL, pcmpistrm is the fastest, and runs at one per 14 cycle throughput, so it's total garbage for simple stuff.

On AMD Jaguar, pcmpistri is 2c throughput, so it might actually be usable (only one load port). pcmpestri is 5c throughput, so it sucks.

On AMD Ryzen, pcmpistri is also 2c throughput, so it's crap there. (2 load ports and 5 uops per clock front-end throughput (or 6 uops if any (or all?) are from multi-uop instructions) mean you can go faster.

On AMD Bulldozer-family, pcmpistri has 3c throughput until Steamroller, where it's 5c. pcmpestri has 10c throughput. They're micro-coded as 7 or 27 m-ops, so AMD didn't spend a lot of silicon on them.

On most CPUs, they're only worth it if you're taking full advantage of them for stuff you can't do with just pcmpeq/pmovmskb. But if you can use AVX2 or especially AVX512BW, even doing complicated things might be faster with more instructions on wider vectors. (There are no wider versions of the SSE4.2 string instructions.) Maybe the SSE4.2 string instructions are still useful for functions that usually deal with short strings, because wide vector loops usually need more startup / cleanup overhead. Also, in a program that doesn't spend much time in SIMD loops, using AVX or AVX512 in one small function will still reduce your max turbo clock speed for the next millisecond or so, and could easily be a net loss.


A good inner loop should bottleneck on load throughput, or come as close as possible. movqdu / pcmpeqw [one-register] / pmovmskb/ macro-fused-cmp+jcc is only 4 fused-domain uops, so this is almost achievable on Sandybridge-family CPUs


See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for an implementation and some benchmarks, but that's for C-style implicit-length strings where you have to check for 0 bytes. It looks like you're using explicit-length strings, so after checking that the lengths are equal, it's just memcmp. (Or I guess if you need to find the sort order instead of just equal / not-equal, you'd have to memcmp out to the end of the shorter string.)

For strcmp with 8-bit strings, on most CPUs it's faster not to use the SSE4.2 string instructions. See the comments on the strchr.com article for some benchmarks (of that implicit-length string version). glibc for example doesn't use the SSE4.2 string instructions for strcmp, because they're not faster on most CPUs. They might be a win for strstr though.


glibc has several SSE2/SSSE3 asm strcmp and memcmp implementations. (It's LGPLed, so you can't just copy it into non-GPL projects, but have a look at what they do.) Some of the string functions (like strlen) only branch per 64 bytes, and then come back to sort out which byte within the cache line had the hit. But their memcmp implementation just unrolls with movdqu / pcmpeqb. You can use pcmpeqw since you want to know the position of the first 16-bit element that's different, rather than the first byte.


Your SSE2 implementation could be even faster. You should use the indexed addressing mode with movdqa since it won't micro-fuse with pcmpeqw (on Intel Sandybridge/Ivybridge; fine on Nehalem or Haswell+), but pcmpeqw xmm0, [eax] will stay micro-fused without unlaminating.

You should unroll a couple times to reduce loop overhead. You should combine the pointer-increment with the loop counter so you cmp/jb instead of sub/ja: macro-fusion on more CPUs, and avoids writing a register (reducing the amount of physical registers needed for register-renaming).

Your inner loop, on Intel Sandybridge/Ivybridge, will run

@@To1:
movdqa xmm0, DQWORD PTR [eax]       // 1 uop
pcmpeqw xmm0, DQWORD PTR [eax+edx] // 2 uops on Intel SnB/IvB, 1 on Nehalem and earlier or Haswell and later.
pmovmskb ebx, xmm0                // 1 uop
cmp ebx, 65535
jne @@Final                     // 1 uop  (macro-fused with cmp)
add eax, 16                    // 1 uop
sub ecx, 8
{ AnsiChar  : sub ecx, 16 }
ja @@To1                     // 1 uop (macro-fused with sub on SnB and later, otherwise 2)

This is 7 fused-domain uops, so it can only issue from the front-end at best 7/4 cycles per iteration on mainstream Intel CPUs. This is very far from bottlenecking on 2 loads per clock. On Haswell and later, it's 6/4 cycles per iteration, because indexed addressing modes can stay micro-fused with 2-operand load-modify instruction like pcmpeqw, but not anything else (like pabsw xmm0, [eax+edx] (doesn't read destination) or AVX vpcmpeqw xmm0, xmm0, [eax+edx] (3 operands)). See Micro fusion and addressing modes.


This could be more efficient for small strings with better setup/cleanup, too.

In your pointer-setup code, you could save a cmp if you check for NULL pointers first. You can sub / jne to subtract and check for both equal with the same macro-fused compare and branch. (It will only macro-fuse on Intel Sandybridge-family, and only Haswell can make 2 macro-fusions in a single decode block. But Haswell/Broadwell/Skylake CPUs are common and becoming ever more common, and this has no downside for other CPUs unless equal-pointers is so common that doing that check first matters.)


In your return path: Always use xor eax,eax to zero a register whenever possible, not mov eax, 0.

You don't seem to avoid reading from past the end of the string. You should test your function with strings that end right at the end of a page, where the next page is unmapped.

xor ebx, [eax + edx] has zero advantages over cmp for the early-out scalar test. cmp/jnz can macro-fuse with the jcc, but xor can't.


You load a mask to handle the cleanup to cover the case where you read past the end of the string. You could probably still use the usual bsf to find the first difference in the bitmap. I guess invert it with not to find the first position that didn't compare equal, and check that that's less than the remaining string length.

Or you could generate the mask on the fly with mov eax, -1 and shr, I think. Or for loading it, you can sometimes use a sliding window into a ...,0,0,0,-1,-1,-1,... array, but you need sub-byte offsets so that doesn't work. (It works well for vector masks, if you wanted to mask and redo the pmovmskb. Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all).

Your way isn't bad, as long as it doesn't cache miss. I'd probably go for generating the mask on the fly. Maybe before the loop in another register, because you can mask to get count % 8, so the mask-generation can happen in parallel with the loop.

这篇关于对于memcmp来说,SSE4.2字符串指令比SSE2快多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆