如何获得GCC创造体面code,检查是否缓冲区充满了NUL字节? [英] How to get gcc to generate decent code that checks if a buffer is full of NUL bytes?
问题描述
我实施了解析磁带归档程序。解析器逻辑的一部分正在检查结束归档的标志是一个512字节块充满了NUL字节。我写了下面code用于此目的,希望GCC优化这口井:
I'm implementing a program that parses tape archives. Part of the parser logic is checking for an end-of-archive marker which is a 512-byte block full of NUL bytes. I wrote the following code for this purpose, expecting gcc to optimize this well:
int is_eof_block(const char usth[static 512])
{
size_t i;
for (i = 0; i < 512; i++)
if (usth[i] != '\0')
return 0;
return 1;
}
不过,出乎我的意料,GCC仍然会产生可怕的code表示,即使我明确地允许它访问缓冲区中的整个512字节:
But to my surprise, gcc still generates terrible code for that, even though I explicitly allow it to access the whole 512 bytes in the buffer:
is_eof_block:
leaq 512(%rdi), %rax
jmp .L239
.p2align 4,,10
.L243:
addq $1, %rdi
cmpq %rax, %rdi
je .L242
.L239:
cmpb $0, (%rdi)
je .L243
xorl %eax, %eax
ret
.p2align 4,,10
.L242:
movl $1, %eax
ret
我预计gcc生成了这样的事情,甚至SIMD code:
I expected gcc to generate something like this or even SIMD code:
is_eof_block:
mov $64,%ecx
xor %eax,%eax
repz scasq
setz %al
ret
我
如何改写code,使得它仍然是可移植的(如:不使用非C99语言扩展,并适用于不支持未对齐内存存取架构),但是编译成更好的机器code。关于常见的体系,如AMD64和AArch32?
How can I rewrite the code such that it is still portable (as in: does not use non-C99 language extensions and works on architectures that do not support misaligned memory access) but compiles to better machine code on common architectures such as amd64 and AArch32?
我写了下面的微基准来证明的时间差。您可以定义未对齐
为正整数与错位缓冲区进行测试。
I wrote the following microbenchmark to demonstrate the time difference. You can define MISALIGNED
to a positive integer to test with misaligned buffers.
#include <stdio.h>
#include <time.h>
#define TESTS 10000000
#ifndef MISALIGNED
# define MISALIGNED 0
#endif
char testarray[512 + MISALIGNED];
extern int is_eof_block(const char[static 512]);
int main()
{
size_t i, j;
clock_t begin, end;
fprintf(stderr, "testing %d times\n", TESTS);
fprintf(stderr, "no byte set to 1... ");
begin = clock();
for (i = 0; i < TESTS; i++)
if (!is_eof_block(testarray + MISALIGNED)) {
fprintf(stderr, "\nWrong test result in iteration %zu!\n", i);
return EXIT_FAILURE;
}
end = clock();
fprintf(stderr, "%fs\n", (end - begin) / (double)CLOCKS_PER_SEC);
fprintf(stderr, "with non-null byte... ");
begin = clock();
for (i = j = 0; i < TESTS; i++) {
testarray[MISALIGNED + j] = '\0';
j = (j + 47) & 511;
testarray[MISALIGNED + j] = '1';
if (is_eof_block(testarray + MISALIGNED)) {
fprintf(stderr, "\nWrong test result in iteration %zu!\n", i);
return EXIT_FAILURE;
}
}
end = clock();
fprintf(stderr, "%fs\n", (end - begin) / (double)CLOCKS_PER_SEC);
return EXIT_SUCCESS;
}
is_eof_block_c.c
#include <stddef.h>
int is_eof_block(const char test[static 512])
{
size_t i;
for (i = 0; i < 512; i++)
if (test[i] != '\0')
return 0;
return 1;
}
is_eof_block_asm.s
.text
.globl is_eof_block
.type is_eof_block,@function
.align 16
is_eof_block:
mov $64,%ecx
xor %eax,%eax
repz scasq
setz %al
ret
.size is_eof_block,.-is_eof_block
下面是用C语言实现 is_eof_block
的链接的输出:
Here is the output with the C implementation of is_eof_block
linked in:
testing 10000000 times
no byte set to 1... 2.281250s
with non-null byte... 1.195312s
和这里是集版本:
testing 10000000 times
no byte set to 1... 0.476562s
with non-null byte... 0.320312s
两人都被编译一个gcc 5与唯一的优化选项是 -O3
。通过各种 -march = ...
标记并没有改变code。的差大约是四的因子。随着对齐缓冲区,装配执行慢大约3%,而没有与C实现的没有什么区别。
Both have been compiled with a gcc 5 with the sole optimization option being -O3
. Passing various -march=...
flags didn't change the code. The difference is about a factor of four. With a misaligned buffer, the assembly implementation is roughly 3% slower whereas there is no difference with the C implementation.
推荐答案
由于真正有用的意见的问题,我已经决定去与原来的C code。感谢大家的帮助!
Due to the genuinely helpful comments to the question, I have decided to go with the original C code. Thanks all of you for your help!
这篇关于如何获得GCC创造体面code,检查是否缓冲区充满了NUL字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!