8086 中的字符串比较 [英] string comparison in 8086

查看：26 发布时间：2021/9/29 19:37:58 string assembly x86 x86-16 strcmp

本文介绍了8086 中的字符串比较的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对这个问题有疑问.我不知道它要我做什么.

问题:编写一个过程，将 DS:SI 处的源字符串与 ES:DI 处的目标字符串进行比较，并相应地设置标志.如果源小于目标，则设置进位标志.如果字符串相等，则设置零标志.如果源大于目标，则零和进位标志都被清除.

我的回答:

MOV ESI , STRING1MOV EDI, STRING2MOV ECX，计数CLDRPE CMPSB

我仍然不确定.这是真的还是我应该尝试其他的东西?

ps:我不明白为什么人们会否决这个问题.我的问题有什么问题?我想我们都是来这里学习的.或不 ?想我吗?

解决方案

如果问题陈述说在您被调用时指针已经在 SI 和 DI 中，你不应该破坏他们.

16 位代码通常不会对所有函数都遵循单一的调用约定，并且在寄存器中传递(前几个)参数通常是好的(更少的指令，并避免存储/重新加载).32 位 x86 调用约定通常使用堆栈参数，但这已经过时了.Windows x64 和 Linux/Mac x86-64 System V ABI/调用约定都使用寄存器参数.

<小时>

不过，问题陈述没有提到计数.因此，您正在为以零字节结尾的字符串实现 strcmp，而不是为已知长度的内存块实现 memcmp.t 使用单个 rep 指令，因为您需要检查不相等和字符串结尾.如果你只是传递一些大尺寸，并且字符串是相等，repe cmpsb会继续通过终止符.

repe cmpsb 如果您知道任一字符串的长度，则可以使用.例如在 CX 中取一个长度 arg 以避免在两个字符串中都超过终止符的问题.

但就性能而言，repe cmpsb 无论如何都不是很快(例如在 Skylake 与 Ryzen 上每次比较需要 2 到 3 个周期.甚至在 Bulldozer 系列上每次比较需要 4 个周期).只有 rep movs 和 rep stos 在现代 CPU 上是有效的，优化的微码一次复制或存储 16(或 32 或 64)个字节.

在内存中存储字符串有 2 个主要约定:显式长度字符串(指针 + 长度)，如 C++ std::string 和 隐式长度字符串，您只有一个指针，字符串的结尾由哨兵/终止符标记.(例如使用 0 字节的 C char* 或使用 '$' 作为终止符的 DOS 字符串打印函数.)<小时>

一个有用的观察是您只需要检查一个字符串中的终止符.如果另一个字符串有终止符而这个没有终止符，这将是不匹配的.

因此，您希望将一个字节从一个字符串加载到寄存器中，并检查它的终止符和另一个字符串的内存.

(如果您需要实际使用 ES:DI 而不仅仅是带有默认 DS 段库的 DI，您可以使用 cmp al, [es: bx + di] (NASM 语法，根据需要进行调整，例如 cmp al, es: [bx + di] 我认为).可能是您打算使用 lodsb 的问题和 scasb，因为 scasb 使用 ES:DI.)

<代码>;;输入:DI 中的 str1 指针，SI 中的 str2 指针;;输出:BX = 不匹配索引，或超过一个终止符.;;标志:ZF=1 表示相等字符串 (je)，ZF=0 表示不匹配 (jne);;clobbers:AL(在返回时保留 str1 的终止符或不匹配的字节)力量:异或 bx, bx.内循环:;做 {移动, [si + bx] ;加载一个源字节cmp al, [di + bx] ;对照另一个字符串检查它jne .mismatch ;如果 (str1[i] != str2[i]) 中断；公司 bx ;索引++测试 al, al ;检查 0. 使用 cmp al, '$' 作为 $ 终止符jnz .innerloop ;}while(str1[i] != 终结符);;fall through (ZF=1) 意味着我们找到了终结者;在 str1 * 和 * str2 中的相同位置，因此它们匹配.不匹配:;我们在不匹配时用 ZF=0 跳到这里;设置 ;可选地从 FLAGS 在 AL 中创建一个整数回复

用法:把指针放在SI/DI中，call strcmp/je match，因为匹配/不匹配状态在FLAGS.如果要将条件转为整数，386 及更高版本的 CPU 允许 sete al 根据 equals 条件 (ZF==1).

使用 sub al, [mem] 而不是 cmp al, [mem]，我们会得到 al = str1[i] - str2[i]，只有在字符串匹配时才给我们一个 0.如果您的字符串仅包含 0..127 的 ASCII 值，则不会导致有符号溢出，因此您可以将其用作有符号返回值，它实际上告诉您哪个字符串在另一个之前/之后排序.(但如果字符串中可能有高位 ASCII 128..255 字节，我们需要先将零或符号扩展到 16 位以避免出现像 (unsigned)5 - (unsigned)254 = (signed)+7 因为 8 位环绕.

当然，有了我们的FLAGS返回值，调用者已经可以使用ja或者jb(无符号比较结果)，或者jg/jl 如果他们想将字符串视为包含 signed char.无论输入字节的范围如何，这都有效.

或者内联这个循环，以便jne mismatch直接跳转到有用的地方.

16 位寻址模式有限，但 BX 可以是base，SI 和 DI 都可以是索引.我使用了索引增量而不是 inc si 和 inc di.使用 lodsb 也是一种选择，甚至可以使用 scasb 将其与其他字符串进行比较.(然后检查终止符.)

<小时>

性能

在一些现代 x86 CPU 上，索引寻址模式可能会更慢，但这确实节省了循环中的指令(因此对于真正的 8086 来说，代码大小很重要).虽然要真正调整为 8086，我认为 lodsb/scasb 将是你最好的选择，替换 mov 负载和 cmp al，[mem]，还有inc bx.如果您的调用约定不能保证，请记住在循环外使用 cld.

如果您关心现代 x86，对于不单独重命名部分寄存器的 CPU，请使用 movzx eax, byte [si+bx] 来打破对 EAX 旧值的错误依赖.(如果您使用 sub al, [str2]，打破错误的 dep 尤为重要，因为这会在 PPro 之外的 CPU 上通过 EAX 将其变成一个 2 周期循环携带的依赖链，通过 Sandybridge.IvyBridge 及更高版本不会将 AL 与 EAX 分开重命名，因此 mov al, [mem] 是一个微融合加载 + 合并 uop.)

如果 cmp al,[bx+di] 微融合负载，并与 jne 宏融合成一个比较和分支 uop，整个循环在 Haswell 上总共可能只有 4 个 uops，并且对于大输入可以以每个时钟 1 次迭代运行.最后的分支预测错误会使小输入的性能变差，除非每次分支都按照足够小的输入进行.请参阅 https://agner.org/optimize/.最近的 Intel 和 AMD 可以每时钟执行 2 次加载.

展开可以摊销 inc bx 成本，但仅此而已.在循环内有一个taken + not-taken 分支时，当前的CPU 的运行速度不会超过每次迭代1 个周期.(请参阅为什么循环总是编译成"do...while"样式(尾跳)?了解有关 do...while 循环结构的更多信息).{}为了更快，我们需要一次检查多个字节.

与 SSE2 每 1 或 2 个周期 16 个字节相比，即使是 1 个字节/周期也非常慢(使用一些巧妙的技巧来避免读取可能出现故障的内存).

请参阅 https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 了解更多信息使用 x86 SIMD 进行字符串比较，以及 glibc 的 SSE2 和后来优化的字符串函数.

<小时>

GNU libc 的后备标量 strcmp 实现看起来不错(从 AT&T 转换为 Intel 语法，但保留了 C 预处理器宏和内容.L() 制作本地标签).

它仅在 SSE2 或更好的版本不可用时使用.有一些小技巧可以检查整个 32 位寄存器中是否有任何零字节，即使没有 SIMD，它也可以让您走得更快，但对齐是一个问题.(如果终止符可以在任何地方，则在一次加载多个字节时必须小心，不要从您不确定确定包含至少 1 个字节的有效数据的任何内存页中读取，否则您可能会出错.)

strcmp:mov ecx,DWORD PTR [esp+0x4]mov edx,DWORD PTR [esp+0x8] # 加载指针参数L(oop): mov al,BYTE PTR [ecx] # movzx eax, byte ptr [ecx] 将避免错误的 depcmp al,BYTE PTR [edx]jne L(neq)公司公司测试 al, aljnz L(哎呀)xorl eax, eax/* 当字符串相等时，指针停在一个以上NUL 终止符的结尾.*/回复L(neq): mov eax, 1mov ecx, -1cmovb eax, ecx回复

I have problem with this question. I don't know what it wants from me.

Question : Write a procedure that compares a source string at DS:SI to a destination string at ES:DI and sets the flags accordingly. If the source is less than the destination, carry flag is set. if string are equal , the zero flag is set. if the source is greater than the destination , the zero and carry flags are both cleared.

My Answer :

MOV ESI , STRING1
MOV EDI, STRING2
MOV ECX, COUNT
CLD
REPE CMPSB

Still I am not sure about it. Is it true or should I try something else ?

p.s: I don't understand why people vote down this question. What is wrong with my question ? I think we are all here for learning. Or not ? Miss I something ?

解决方案

If the problem statement says the pointers are already in SI and DI when you're called, you shouldn't clobber them.

16-bit code often doesn't stick to a single calling convention for all functions, and passing (the first few) args in registers is usually good (fewer instructions, and avoids store/reload). 32-bit x86 calling conventions usually use stack args, but that's obsolete. Both the Windows x64 and Linux/Mac x86-64 System V ABIs / calling conventions use register args.

The problem statement doesn't mention a count, though. So you're implementing strcmp for strings terminated by a zero-byte, rather than memcmp for known-length blocks of memory. You can't use a single rep instruction, since you need to check for non-equal and for end-of-string. If you just pass some large size, and the strings are equal, repe cmpsb would continue past the terminator.

repe cmpsb is usable if you know the length of either string. e.g. take a length arg in CX to avoid the problem of running past the terminator in both strings.

But for performance, repe cmpsb isn't fast anyway (like 2 to 3 cycles per compare, on Skylake vs. Ryzen. Or even 4 cycles per compare on Bulldozer-family). Only rep movs and rep stos are efficient on modern CPUs, with optimized microcode that copies or stores 16 (or 32 or 64) bytes at a time.

There are 2 major conventions for storing strings in memory: Explicit-length strings (pointer + length) like C++ std::string, and implicit length strings where you just have a pointer, and the end of string is marked by a sentinel / terminator. (Like C char* that uses a 0 byte, or DOS string-print functions that use '$' as a terminator.)

A useful observation is that you only need to check for the terminator in one of the strings. If the other string has a terminator and this one doesn't, it will be a mismatch.

So you want to load a byte into a register from one string, and check it for the teminator and against memory for the other string.

(If you need to actually use ES:DI instead of just DI with the default DS segment base, you can use cmp al, [es: bx + di] (NASM syntax, adjust as needed like maybe cmp al, es: [bx + di] I think). Probably the question intended for you to use lodsb and scasb, because scasb uses ES:DI.)

;; inputs:  str1 pointer in DI, str2 pointer in SI
;; outputs: BX = mismatch index, or one-past-the-terminators.
;;          FLAGS:  ZF=1 for equal strings (je),  ZF=0 for mismatch (jne)
;; clobbers: AL (holds str1's terminator or mismatching byte on return)
strcmp:
    xor    bx, bx
.innerloop:                 ; do { 
    mov    al, [si + bx]        ; load a source byte
    cmp    al, [di + bx]        ; check it against the other string
    jne   .mismatch             ; if (str1[i] != str2[i]) break;

    inc    bx                   ; index++

    test   al, al               ; check for 0.  Use cmp al, '$' for a $ terminator
    jnz   .innerloop        ; }while(str1[i] != terminator);

    ; fall through (ZF=1) means we found the terminator
    ; in str1 *and* str2 at the same position, thus they match

.mismatch:               ; we jump here with ZF=0 on mismatch
    ; sete  al           ; optionally create an integer in AL from FLAGS
    ret

Usage: put pointers in SI/DI, call strcmp / je match, because the match / non-match state is in FLAGS. If you want to turn the condition into an integer, 386 and later CPUs allow sete al to create a 0 or 1 in AL according to the equals condition (ZF==1).

Using sub al, [mem] instead of cmp al, [mem], we'd get al = str1[i] - str2[i], giving us a 0 only if the strings matched. If your strings only hold ASCII values from 0..127, that can't cause signed overflow, so you can use it as a signed return value that actually tells you which string sorts before/after the other. (But if there could be high-ASCII 128..255 bytes in the string, we'd need to zero- or sign-extend to 16-bit first to avoid signed overflow for a case like (unsigned)5 - (unsigned)254 = (signed)+7 because of 8-bit wraparound.

Of course, with our FLAGS return value, the caller can already use ja or jb (unsigned compare results), or jg / jl if they want to treat the strings as holding signed char. This works regardless of the range of input bytes.

Or inline this loop so jne mismatch jumps somewhere useful directly.

16-bit addressing modes are limited, but BX can be a base, and SI and DI can both be indices. I used an index increment instead of inc si and inc di. Using lodsb would also be an option, and maybe even scasb to compare it to the other string. (And then check the terminator.)

Performance

Indexed addressing modes can be slower on some modern x86 CPUs, but this does save instructions in the loop (so it's good for true 8086 where code-size matters). Although to really tune for 8086, I think lodsb / scasb would be your best bet, replacing the mov load and cmp al, [mem], and also the inc bx. Just remember to use cld outside the loop if your calling convention doesn't guarantee that.

If you care about modern x86, use movzx eax, byte [si+bx] to break the false dependency on the old value of EAX, for CPUs that don't rename partial registers separately. (Breaking the false dep is especially important if you use sub al, [str2] because that would turn it into a 2-cycle loop-carried dependency chain through EAX, on CPUs other than PPro through Sandybridge. IvyBridge and later doesn't rename AL separately from EAX, so mov al, [mem] is a micro-fused load+merge uop.)

If the cmp al,[bx+di] micro-fuses the load, and macro-fuses with jne into one compare-and-branch uop, the whole loop could be only 4 uops total on Haswell, and could run at 1 iteration per clock for large inputs. The branch mispredict at the end will make small-input performance worse, unless branching goes the way way every time for a small enough input. See https://agner.org/optimize/. Recent Intel and AMD can do 2 loads per clock.

Unrolling could amortize the inc bx cost, but that's all. With a taken + not-taken branch inside the loop, no current CPUs can run this faster than 1 cycle per iteration. (See Why are loops always compiled into "do...while" style (tail jump)? for more about the do{}while loop structure). To go faster we'd need to check multiple bytes at once.

Even 1 byte / cycle is very slow compared to 16 bytes per 1 or 2 cycles with SSE2 (using some clever tricks to avoid reading memory that might fault).

See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for more about using x86 SIMD for string compare, and also glibc's SSE2 and later optimized string functions.

GNU libc's fallback scalar strcmp implementation looks decent (translated from AT&T to Intel syntax, but with the C preprocessor macros and stuff left in. L() makes a local label).

It only uses this when SSE2 or better isn't available. There are bithacks for checking a whole 32-bit register for any zero bytes, which could let you go faster even without SIMD, but alignment is a problem. (If the terminator could be anywhere, you have to be careful when loading multiple bytes at once to not read from any memory pages that you aren't sure contain at least 1 byte of valid data, otherwise you could fault.)

strcmp:
         mov    ecx,DWORD PTR [esp+0x4]
         mov    edx,DWORD PTR [esp+0x8]     # load pointer args

L(oop): mov     al,BYTE PTR [ecx]          # movzx eax, byte ptr [ecx] would be avoid a false dep
        cmp     al,BYTE PTR [edx]
        jne     L(neq)
        inc     ecx
        inc     edx
        test    al, al
        jnz     L(oop)

        xorl    eax, eax
        /* when strings are equal, pointers rest one beyond
           the end of the NUL terminators.  */
        ret

L(neq): mov    eax,  1
        mov    ecx, -1
        cmovb  eax, ecx
        ret

这篇关于8086 中的字符串比较的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

8086 中的字符串比较 [英] string comparison in 8086

问题描述

性能

Performance

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

8086 中的字符串比较 [英] string comparison in 8086

问题描述

性能

Performance

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭