获取最后的行分隔符 [英] Get the last line separator
问题描述
我认为这是一个非常常见的任务,应该有一些快速而简洁的解决方案.我有一个四字,我想获得等于0x0A的最低字节位置(在Linux中为换行符).我编写了以下简单程序:
I think this is a pretty common task and there should be some fast and concise solution. I have a quadword and I want to get the lowest byte position which is equal to 0x0A (line break in Linux). I wrote the following simple program:
SYS_exit equ 0x3C
section .text
global _start
_start:
mov rax, 0x0A
mov rbx, [dt]
mov rcx, 0x07
loop:
mov r13, rbx
and r13, 0xFF
cmp r13, 0x0A
jz ok
shr rbx, 8
dec rcx
jnz loop
jmp fail
ok:
mov rax, SYS_exit
mov rdi, 8
sub rdi, rcx
syscall
fail:
mov rax, SYS_exit
mov rdi, -1
syscall
section .data
dt: dq 0xAB97450A8733AA1F
它工作得很好. strace ./bin
打印
execve("./bin", ["./bin"], [/* 69 vars */]) = 0
exit(5) = ?
+++ exited with 5 +++
但是该程序看起来很难看,实际上我正在寻找一种使该过程尽可能快的方法.您能提供一些优化建议吗?
But the program looks ugly and actually I'm looking for a way to make this as fast as possible. Can you give some optimization advice?
推荐答案
但是程序看起来很丑
But the program looks ugly
恭喜:P
我正在寻找一种使之尽快完成的方法
I'm looking for a way to make this as fast as possible
SSE2是x86-64的基线,因此您应该使用它.您可以在几个指令中执行此操作,使用pcmpeqb/pmovmskb获取字节比较结果的位图,然后使用位扫描指令,例如 bsr
(反向扫描会为您提供最高设置位的索引).
SSE2 is baseline for x86-64, so you should use it. You can do this in a couple instructions, using pcmpeqb / pmovmskb to get a bitmap of byte-compare results, then use a bit-scan instruction like bsr
(scan reverse gives you the index of the highest set bit).
default rel ; don't forget this: RIP-relative addressing is best for loading/storing global data
_start:
movq xmm0, [dt] ; movq xmm0, rdx or whatever works too.
pcmpeqb xmm0, [newline_mask] ; -1 for match, 0 for no-match
pmovmskb edi, xmm0
bsr edi, edi ; index of highest set bit
mov eax, SYS_exit
jz .not_found ; BSR sets ZF if the *input* was zero
; [dt+rdi] == 0xA
syscall ; exit(0..7)
.not_found:
mov edi, -1 ; exit only cares about the low byte of its arg; a 64-bit -1 is pointless.
syscall
section .rodata
align 16
newline_mask: times 16 db 0x0a
section .data
dt: dq 0xAB97450A8733AA1F
显然,在循环中,您会将newline_mask
保留在寄存器中(然后可以使用AVX vbroadcastss
或SSE3 movddup
广播加载它,而不需要在内存中保留整个16个字节的常量).
Obviously in a loop you'd keep newline_mask
in a register (and then you can broadcast-load it with AVX vbroadcastss
, or SSE3 movddup
, instead of needing a whole 16 byte constant in memory).
当然,您可以在加载movdqu
时一次执行16个字节的操作,或者在使用AVX2时一次执行32字节的操作. 如果您有很大的缓冲区,则基本上是在实现向后的memcmp
,并应考虑优化的库实现.,它们可能会将pcmpeqb
的结果与por
结合在一起用于整个缓存行,因此他们将3/4的pmovmskb
工作保存到最后,直到他们找出缓存行的哪个部分被命中为止.
And of course you can do this for 16 bytes at a time with a movdqu
load, or 32 bytes at a time with AVX2. If you have a large buffer, you're basically implementing a backwards memcmp
and should look at optimized library implementations. They might combine pcmpeqb
results for a whole cache line with por
, so they save 3/4 of the pmovmskb
work until the end when they sort out which part of the cache line had the hit.
如果您关心AMD CPU(bsr
较慢),则可以在使用tzcnt
之前分别使用test edi,edi
/jz
测试输入= 0. (tzcnt(x)
为您提供31-bsr(x)
,如果输入为全零,则为32.)如果您可以依靠BMI2的可用性...
If you care about AMD CPUs (where bsr
is slow), maybe separately test for input=0 with test edi,edi
/ jz
before using tzcnt
. (tzcnt(x)
gives you 31-bsr(x)
, or 32 if the input was all-zero.) If you can depend on BMI2 being available...
如果要通过标量循环进行操作,则可以在寄存器的低字节上使用字节比较,而不用复制和屏蔽该值.
If you wanted to do it with a scalar loop, you could use byte compares on the low byte of a register instead of copying and masking the value.
; we test byte 7 first, so start the counter there.
mov edi, 7 ; no idea why you were using a 64-bit counter register
; loop body runs with edi=7..0
.loop: ; do{
rol rbx, 8 ; high byte becomes low
cmp bl, 0xa ; check the low byte
je .found
dec edi
jge .loop ; } while(--edi>=0) signed compare
; not found falls through with edi=-1
.found:
mov eax, SYS_exit
syscall ; exit(7..0) for found, or exit(-1) for not-found
根据对结果的处理方式,可以对循环计数器进行不同的安排.
Depending on what you're doing with the result, you might arrange your loop counter differently.
这篇关于获取最后的行分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!