pcmpestri字符单位和倒计时-x86-64 asm [英] pcmpestri character units and countdown - x86-64 asm

查看:76
本文介绍了pcmpestri字符单位和倒计时-x86-64 asm的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在x86-64 asm(实际上是使用GDC编译器嵌入Dlang的嵌入式asm)中围绕pcmpestri编写一个最小的循环.我有些不明白的事

  1. 我正在使用带有两个指向字符串的指针的pcmpestri,是rax和rdx中字符串的长度吗?
  2. 如果是,单位是什么?总是以字节为单位,还是以字符数为单位,其中uword的1个计数= 2个字节?
  3. pcmpestri是否检查短字符串?即len str1或str2<16个字节或8个uword(如果有uword)
  4. pcmpestri是否将rax和rax每块计数为n,还是必须这样做?总是减去16还是(减去16还是8,取决于字节/uword)?
  5. 我是否需要担心以下提取中的128位对齐?我可以预先检查该字符串是否对齐128位(如果更快),但是那样可能会变得很混乱.如果我使用不需要128位对齐的指令,速度会慢多少?见下文
  6. 在ja之前使用lea%[offset],[%[offset]-16]是否更慢?(由于未设置标志而选择)
  7. 值得循环展开吗?还是一个可怕的主意?
  8. 我需要什么信息才能传递回高级别的lang代码?rcx我知道我一件事,这些标志也可以,或者我可以忘记它们吗?(在较早的例程中,如果未使用final ja,则如果cond为"na",则传递true.)
  9. 最后一个问题:如何传回更新的偏移量?

留出我需要的序言

 ;保留说xmm1作为工作变量循环:添加%[offset],16;16个字节=字符串块的nbytes;我是否需要计算字符串长度?每块减少16或每块减少(8或16)?movdqa xmm1,[%[pstr1] +%[offset]-16];-16补偿预添加pcmpestri xmm1,[%[pstr1] +%[offset]-16],0;模式= 0或1表示uwordja循环 

;如何将信息传回主代码?;我已经传回了rcx =块内偏移,我是否也需要将标志传回;我已将rcx声明为输出来保留它;传递%[offset]的值怎么办?或传递倒计时的长度?

我还没有找到带有单词而不是字节的示例.

对于一个1字符串的用法模式,我保留说xmm1作为输入参数xmm reg:

 循环:添加%[offset],16;16个字节=字符串块的nbytespcmpestri xmm1,[%[pstr1] +%[offset]-16],0;模式= 0或1表示uwordja循环 

解决方案

在主循环中(两个输入字符串的其余长度均为> = 16),请使用 pcmpistri (隐式长度字符串版本)(如果您知道数据中没有 0 个字节). pcmpistri 在大多数CPU上显着提高了速度并减少了uops,这也许是因为它只有3个输入(包括立即数)而不是5个输入.( https://uops.info/)

我是否需要担心以下提取中的128位对齐?

对于 movdqa 当然是,但是令人惊讶的是SSE4.2字符串指令不会在未对齐的内存操作数上出错!对于所有先前指令的旧版SSE(非VEX)编码(未对齐的mov(如 movups / movdqu )除外),必须对齐16字节的内存操作数.英特尔手册:此外,如果以下情况,该指令不会导致#GP内存操作数未与16字节边界对齐".

当然,您仍然必须避免进入未映射的页面,例如对于从未映射页面开始的7个字节开始的5个字节的字符串,一个16个字节的内存操作数仍将发生页面错误.(

对于显式长度的字符串,这很容易:您知道何时绝对距离较短的字符串的末尾很远,因此可以在特殊情况下最后一次迭代.(而且您还是想这样做,以便可以在主循环中使用 pcmpistri ).

例如如果字符串长度至少为16个字节,请执行以字符串的最后一个字节结尾的未对齐操作,或者检查(p& 4095)< =(4096-16)以避免页面交叉加载当您获取字符串的结尾时.

因此在实践中,如果两个字符串的相对对齐方式相同,则可以处理字符串的未对齐开头,然后进入使用两个对齐负载的循环(因此您可以继续使用 movdqa ).这不能分页,因此在加载包含任何字符串字节的任何对齐的矢量时也不会出错.

相对未对准比较困难.

为了提高性能,请注意,只有Nehalem和更高版本才支持SSE4.2,其中 movdqu 相对有效(如果指针碰巧对齐,其价格与 movdqa 一样便宜)在运行时).我认为AMD的支持与此类似.直到具有AVX和廉价未对准负载的推土机为止.高速缓存行的拆分仍然会伤害某些人,因此,如果您期望 large 字符串很常见,那么值得通过做一些额外的检查来损害短字符串和/或已经对齐的字符串.

也许看看glibc的SSE2/AVX memcmp 实现是做什么的;它具有从2个数组中读取SIMD向量的相同问题,该数组可能未对齐.彼此.(简单的按字节相等使用 pcmpeqb ,因此它不会使用SSE4.2字符串指令,但是要加载哪些SIMD向量的问题是相同的.


pcmpestri是否检查短字符串?

是的,这就是获取2个输入长度的要点(对于XMM1,为RAX;对于XMM2,为RDX).有关 pcmpestri 的信息,请参阅英特尔的asm手册条目.

pcmpestri是否将rax和rax每块计数为n或者我必须这样做

如果这是您想要的,就必须这样做; pcmpestri 查看XMM1的前RAX字节/字(最多16/8)和XMM2/mem的前RDX字节(字)(最多16/8),并输出到ECX和EFLAGS.就这些.同样,英特尔手册对此非常清楚.(虽然了解实际的汇总和比较选项相当复杂!)

如果要在循环中使用它,可以将那些寄存器设置为 16 并正确计算它们,以在循环后进行最终剥离.或者您可以将它们每个递减16次; pcmpestri 似乎是为此设计的,如果EDX和/或EAX小于< ;,则设置ZF和/或SF.16(或8).


另请参见 https://www.strchr.com/strcmp_and_strlen_using_sse_4.2要获得SSE4.2字符串指令所做的处理步骤的有用的概括图,因此您可以弄清楚如何设计使用它们的有用方法.还有一些示例,例如实现 strcmp strlen .英特尔在SDM中的详细文档陷入了困境,难以一目了然.

(良好的展开式SSE2实现在这些简单功能上可以胜过SSE4.2,但一个简单的问题便是一个很好的例子.)


我需要传递回高级语言代码的哪些信息?

理想情况下,您将拥有适当的 intrinsics ,而不仅仅是内联汇编的包装器.

这可能取决于高级代码想要做什么,尽管特别是对于 pcmpestri ,所有信息都存在于ECX中(整数结果). CF =(ECX == 0) OF = ECX [0] (低位).如果GDC具有GCC6标志输出语法,我猜不会伤害,除非它欺骗编译器制作更糟糕的代码来接收这些输出.

如果您使用inline-asm基本上为SSE4.2字符串指令创建内在函数,那么可能值得看一下英特尔针对C内在函数的设计:微融合和寻址模式

可能值得将其展开为2.我建议计数uops,以查看它是否刚好是4的倍数,和/或在Skylake和Zen等CPU上尝试使用.

I’m trying to write a minimal loop around pcmpestri in x86-64 asm (actually in-line asm embedded in Dlang using the GDC compiler). There are a couple of things that I don’t understand

  1. I you are using pcmpestri with two pointers to strings, are the lengths of the strings in rax and rdx ?
  2. If so, what are the units? count in bytes always, or count in chars where 1 count = 2 bytes for uwords ?
  3. Does pcmpestri check for short strings? ie len str1 or str2 < 16 bytes or 8 uwords if uwords
  4. Does pcmpestri count rax and rax down by n per chunk or do I have to do it ? subtracting either 16 always or (16 or 8 depending on bytes/uwords)?
  5. Do I have to worry about 128-bit alignment on the fetch below? I could precheck that the string is 128-bit aligned if it’s faster, but then that could get really messy. If I use instructions that don’t require 128-bit alignment how much slower will that be? see below
  6. Is it slower to use lea %[offset], [ %[offset] - 16 ] before the ja ? (chosen as it doesn’t set flags)
  7. Worth loop-unrolling? Or a terrible idea ?
  8. What info do I need to pass back to the hi-level lang code? rcx I know i one thing, the flags too or can I forget about them? (In an earlier routine I passed back true if cond ‘na’ if final ja not-taken.)
  9. One final question: what about passing back updated offset?

leaving out required preamble I have:

; having reserved say xmm1 as a working variable

loop:   add       %[offset], 16  ; 16 bytes = nbytes of chunk of string
; do I need to count  lengths of strings down ? by 16 per chunk or by (8 or 16) per chunk ?
        movdqa    xmm1, [ %[pstr1] + %[offset] - 16 ]    ; -16 to compensate for pre-add
        pcmpestri xmm1, [ %[pstr1] + %[offset] - 16 ], 0 ; mode=0 or 1 for uwords
        ja      loop

; what do I do about passing back info to the main code? ; I already pass back rcx = offset-in-chunk, do I need to pass the flags back too ; I have reserved rcx by declaring it as an output ; what about passing down the value of %[offset]? or passing the counted-down lengths?

I haven’t managed to find examples that feature words rather than bytes.

And for a 1-string usage pattern, where I have reserved say xmm1 as an input argument xmm reg :

loop:   add       %[offset], 16  ; 16 bytes = nbytes of chunk of string
        pcmpestri xmm1, [ %[pstr1] + %[offset] - 16 ], 0 ; mode=0 or 1 for uwords
        ja      loop

解决方案

In your main loop (while remaining lengths of both input strings are >=16), use pcmpistri (the implicit-length string version) if you know there are no 0 bytes in your data. pcmpistri significantly faster and fewer uops on most CPUs, perhaps because it only has 3 inputs (including the immediate) instead of 5. (https://uops.info/)

Do I have to worry about 128-bit alignment on the fetch below?

Yes for movdqa of course, but surprisingly the SSE4.2 string instructions don't fault on misaligned memory operands! For the legacy SSE (non-VEX) encoding of all previous instructions (except unaligned mov like movups / movdqu), 16-byte memory operands must be aligned. Intel's manual notes: "additionally, this instruction does not cause #GP if the memory operand is not aligned to 16 Byte boundary".

Of course you still have to avoid crossing into an unmapped page, e.g. for a 5 byte string that starts 7 bytes before an unmapped page, a 16-byte memory operand will still page-fault. (Is it safe to read past the end of a buffer within the same page on x86 and x64?) I don't see any mention of fault-suppression for the "ignored" part of a memory source operand in Intel's manual, unlike with AVX-512 masked loads.

For explicit-length strings, this is easy: you know when you're definitely far from the end of the shorter string, so you can just special case the last iteration. (And you want to do that anyway so you can use pcmpistri in the main loop).

e.g. do an unaligned that ends at the last byte of the string, if it's at least 16 bytes long, or check (p&4095) <= (4096-16) to avoid a page-crossing load when you're fetching that the end of a string.

So in practice, if both strings have the same relative alignment you can just handle the unaligned starts of the strings, then get into a loop that uses aligned loads from both (so you can keep using movdqa). That can't page-split and thus can't fault when loading any aligned vector that contains any string bytes.

relative misalignment is harder.

For performance, note that SSE4.2 is only supported on Nehalem and newer, where movdqu is relatively efficient (as cheap as movdqa if the pointer happens to be aligned at runtime). I think AMD support is similar; not until Bulldozer which has AVX and cheap unaligned loads. Cache-line splits still hurt some, so if you expect large strings to be common then it's worth maybe hurting the short-string case and/or the already-aligned case by doing some extra checking.

Maybe have a look at what glibc's SSE2 / AVX memcmp implementation does; it has the same problem of reading SIMD vectors from 2 arrays that might be misaligned wrt. each other. (Simple bytewise equality is faster with pcmpeqb so it wouldn't use SSE4.2 string instructions, but the problem of which SIMD vectors to load is the same).


Does pcmpestri check for short strings?

Yes, that's the whole point of taking 2 input lengths (in RAX for XMM1, and RDX for XMM2). See Intel's asm manual entry for pcmpestri.

Does pcmpestri count rax and rax down by n per chunk or do I have to do it

You have to do it if that's what you want; pcmpestri looks at the first RAX bytes/words of XMM1 (up to 16 / 8), and the first RDX bytes (words) of XMM2/mem (up to 16 / 8), and outputs to ECX and EFLAGS. That is all. Again, Intel's manual is pretty clear about this. (Although pretty complicated to understand the actual aggregation and compare options!)

If you wanted to use it in a loop, you could just leave those registers set to 16 and compute them properly for a peeled final iteration after the loop. Or you could decrement them each by 16 each iteration; pcmpestri appears to be designed for doing that, setting ZF and/or SF if EDX and/or EAX are < 16 (or 8), respectively.


See also https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a useful high-level picture of the processing steps the SSE4.2 string instructions do, so you can figure out how to design useful ways to use them. And some examples like implementing strcmp and strlen. Intel's detailed documentation in the SDM gets bogged down in details and hard to take in the big picture.

(A good unrolled SSE2 implementation can beat SSE4.2 for those simple functions, but a simple problem makes a good example.)


What info do I need to pass back to the hi-level lang code?

Ideally you'd have proper intrinsics, not just wrappers for inline asm.

It probably depends what high-level code wants to do with it, although for pcmpestri specifically, all the information is present in in ECX (the integer result). CF = (ECX == 0), and OF = ECX[0] (low bit). If GDC has GCC6 flag-output syntax, it wouldn't hurt I guess, unless it tricks the compiler into making worse code to receive those outputs.

If you are using inline-asm to basically create intrinsics for SSE4.2 string instructions, it might be worth looking at Intel's design for C intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/.

e.g. one for the ECX result, int _mm_cmpestri (__m128i a, int la, __m128i b, int lb, const int mode);
And one each for each separate FLAG output bit, like _mm_cmpestro

However, there are flaws in Intel's design. For example, with the implicit-length string version at least, I remember that the only way to get an integer result and get the compiler to branch on FLAGS directly from the instruction was to use two different intrinsics with the same inputs, and depend on the compiler optimizing them together.

With inline asm, it's easy to describe multiple outputs and have unused ones be optimized away. But unfortunately C doesn't have syntax for multiple return values, and I guess Intel didn't want to have an intrinsic with a by-reference output arg as well as a return value.

Is it slower to use lea %[offset], [ %[offset] - 16 ] before the ja ? (chosen as it doesn’t set flags)

I'd do the movdqa load first, then add, then pcmpistri. That keeps the movdqa addressing mode simpler and smaller, and lets the first iteration's load start executing 1 cycle earlier, without waiting for the latency of an add (if the index was on the critical path; it might not be if you started at 0)

Using an indexed addressing mode is probably not harmful here (a multi-uop instruction like pcmpe/istri probably can't micro-fuse a load anyway, and movdqa / movdqu don't care). But in other cases it can be worth it to unroll and use pointer increments instead: Micro fusion and addressing modes

It might be worth unrolling by 2. I'd suggest counting uops to see if it's just above a multiple of 4, and/or trying it on a couple CPUs like Skylake and Zen.

这篇关于pcmpestri字符单位和倒计时-x86-64 asm的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆