获取最后的行分隔符 [英] Get the last line separator

查看:116
本文介绍了获取最后的行分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为这是一个非常常见的任务,应该有一些快速而简洁的解决方案.我有一个四字,我想获得等于0x0A的最低字节位置(在Linux中为换行符).我编写了以下简单程序:

I think this is a pretty common task and there should be some fast and concise solution. I have a quadword and I want to get the lowest byte position which is equal to 0x0A (line break in Linux). I wrote the following simple program:

SYS_exit equ 0x3C

section .text
    global _start

_start:
    mov rax, 0x0A
    mov rbx, [dt]
    mov rcx, 0x07
    loop:
        mov r13, rbx
        and r13, 0xFF
        cmp r13, 0x0A
        jz ok
        shr rbx, 8
        dec rcx
        jnz loop
        jmp fail
    ok:
        mov rax, SYS_exit
        mov rdi, 8
        sub rdi, rcx
        syscall
    fail:
        mov rax, SYS_exit
        mov rdi, -1
        syscall


section .data
    dt: dq 0xAB97450A8733AA1F

它工作得很好. strace ./bin打印

execve("./bin", ["./bin"], [/* 69 vars */]) = 0
exit(5)                                 = ?
+++ exited with 5 +++

但是该程序看起来很难看,实际上我正在寻找一种使该过程尽可能快的方法.您能提供一些优化建议吗?

But the program looks ugly and actually I'm looking for a way to make this as fast as possible. Can you give some optimization advice?

推荐答案

但是程序看起来很丑

But the program looks ugly

恭喜:P

我正在寻找一种使之尽快完成的方法

I'm looking for a way to make this as fast as possible

SSE2是x86-64的基线,因此您应该使用它.您可以在几个指令中执行此操作,使用pcmpeqb/pmovmskb获取字节比较结果的位图,然后使用位扫描指令,例如 bsr (反向扫描会为您提供最高设置位的索引).

SSE2 is baseline for x86-64, so you should use it. You can do this in a couple instructions, using pcmpeqb / pmovmskb to get a bitmap of byte-compare results, then use a bit-scan instruction like bsr (scan reverse gives you the index of the highest set bit).

default  rel      ; don't forget this: RIP-relative addressing is best for loading/storing global data

_start:
    movq    xmm0, [dt]             ; movq xmm0, rdx  or whatever works too.
    pcmpeqb xmm0, [newline_mask]   ; -1 for match, 0 for no-match
    pmovmskb edi, xmm0
    bsr     edi, edi                  ; index of highest set bit

    mov  eax, SYS_exit
    jz  .not_found                 ; BSR sets ZF if the *input* was zero
    ; [dt+rdi] == 0xA
    syscall                        ; exit(0..7)

.not_found:
    mov  edi, -1                   ; exit only cares about the low byte of its arg; a 64-bit -1 is pointless.
    syscall

section .rodata
    align 16
    newline_mask: times 16 db 0x0a

section .data
    dt: dq 0xAB97450A8733AA1F

显然,在循环中,您会将newline_mask保留在寄存器中(然后可以使用AVX vbroadcastss或SSE3 movddup广播加载它,而不需要在内存中保留整个16个字节的常量).

Obviously in a loop you'd keep newline_mask in a register (and then you can broadcast-load it with AVX vbroadcastss, or SSE3 movddup, instead of needing a whole 16 byte constant in memory).

当然,您可以在加载movdqu时一次执行16个字节的操作,或者在使用AVX2时一次执行32字节的操作. 如果您有很大的缓冲区,则基本上是在实现向后的memcmp,并应考虑优化的库实现.,它们可能会将pcmpeqb的结果与por结合在一起用于整个缓存行,因此他们将3/4的pmovmskb工作保存到最后,直到他们找出缓存行的哪个部分被命中为止.

And of course you can do this for 16 bytes at a time with a movdqu load, or 32 bytes at a time with AVX2. If you have a large buffer, you're basically implementing a backwards memcmp and should look at optimized library implementations. They might combine pcmpeqb results for a whole cache line with por, so they save 3/4 of the pmovmskb work until the end when they sort out which part of the cache line had the hit.

如果您关心AMD CPU(bsr较慢),则可以在使用tzcnt之前分别使用test edi,edi/jz测试输入= 0. (tzcnt(x)为您提供31-bsr(x),如果输入为全零,则为32.)如果您可以依靠BMI2的可用性...

If you care about AMD CPUs (where bsr is slow), maybe separately test for input=0 with test edi,edi / jz before using tzcnt. (tzcnt(x) gives you 31-bsr(x), or 32 if the input was all-zero.) If you can depend on BMI2 being available...

如果要通过标量循环进行操作,则可以在寄存器的低字节上使用字节比较,而不用复制和屏蔽该值.

If you wanted to do it with a scalar loop, you could use byte compares on the low byte of a register instead of copying and masking the value.

    ; we test byte 7 first, so start the counter there.
    mov  edi, 7         ; no idea why you were using a 64-bit counter register
   ; loop body runs with edi=7..0
.loop:                ; do{
    rol  rbx, 8         ; high byte becomes low

    cmp  bl, 0xa        ; check the low byte
    je   .found

    dec  edi
    jge  .loop        ; } while(--edi>=0) signed compare
   ; not found falls through with edi=-1

 .found:
    mov  eax, SYS_exit
    syscall           ; exit(7..0) for found, or exit(-1) for not-found

根据对结果的处理方式,可以对循环计数器进行不同的安排.

Depending on what you're doing with the result, you might arrange your loop counter differently.

这篇关于获取最后的行分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆