如何获得GCC创造体面code,检查是否缓冲区充满了NUL字节? [英] How to get gcc to generate decent code that checks if a buffer is full of NUL bytes?

查看:115
本文介绍了如何获得GCC创造体面code,检查是否缓冲区充满了NUL字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实施了解析磁带归档程序。解析器逻辑的一部分正在检查结束归档的标志是一个512字节块充满了NUL字节。我写了下面code用于此目的,希望GCC优化这口井:

I'm implementing a program that parses tape archives. Part of the parser logic is checking for an end-of-archive marker which is a 512-byte block full of NUL bytes. I wrote the following code for this purpose, expecting gcc to optimize this well:

int is_eof_block(const char usth[static 512])
{
    size_t i;

    for (i = 0; i < 512; i++)
        if (usth[i] != '\0')
            return 0;

    return 1;
}

不过,出乎我的意料,GCC仍然会产生可怕的code表示,即使我明确地允许它访问缓冲区中的整个512字节:

But to my surprise, gcc still generates terrible code for that, even though I explicitly allow it to access the whole 512 bytes in the buffer:

is_eof_block:
    leaq    512(%rdi), %rax
    jmp .L239
    .p2align 4,,10
.L243:
    addq    $1, %rdi
    cmpq    %rax, %rdi
    je  .L242
.L239:
    cmpb    $0, (%rdi)
    je  .L243
    xorl    %eax, %eax
    ret
    .p2align 4,,10
.L242:
    movl    $1, %eax
    ret

我预计gcc生成了这样的事情,甚至SIMD code:

I expected gcc to generate something like this or even SIMD code:

is_eof_block:
    mov $64,%ecx
    xor %eax,%eax
    repz scasq
    setz %al
    ret

如何改写code,使得它仍然是可移植的(如:不使用非C99语言扩展,并适用于不支持未对齐内存存取架构),但是编译成更好的机器code。关于常见的体系,如AMD64和AArch32?

How can I rewrite the code such that it is still portable (as in: does not use non-C99 language extensions and works on architectures that do not support misaligned memory access) but compiles to better machine code on common architectures such as amd64 and AArch32?

我写了下面的微基准来证明的时间差。您可以定义未对齐为正整数与错位缓冲区进行测试。

I wrote the following microbenchmark to demonstrate the time difference. You can define MISALIGNED to a positive integer to test with misaligned buffers.

#include <stdio.h>
#include <time.h>

#define TESTS 10000000
#ifndef MISALIGNED
# define MISALIGNED 0
#endif

char testarray[512 + MISALIGNED];

extern int is_eof_block(const char[static 512]);

int main()
{
    size_t i, j;
    clock_t begin, end;

    fprintf(stderr, "testing %d times\n", TESTS);
    fprintf(stderr, "no byte set to 1... ");
    begin = clock();

    for (i = 0; i < TESTS; i++)
        if (!is_eof_block(testarray + MISALIGNED)) {
            fprintf(stderr, "\nWrong test result in iteration %zu!\n", i);
            return EXIT_FAILURE;
        }

    end = clock();
    fprintf(stderr, "%fs\n", (end - begin) / (double)CLOCKS_PER_SEC);

    fprintf(stderr, "with non-null byte... ");
    begin = clock();

    for (i = j = 0; i < TESTS; i++) {
        testarray[MISALIGNED + j] = '\0';
        j = (j + 47) & 511;
        testarray[MISALIGNED + j] = '1';

        if (is_eof_block(testarray + MISALIGNED)) {
            fprintf(stderr, "\nWrong test result in iteration %zu!\n", i);
            return EXIT_FAILURE;
        }       
    }

    end = clock();
    fprintf(stderr, "%fs\n", (end - begin) / (double)CLOCKS_PER_SEC);

    return EXIT_SUCCESS;
}

is_eof_block_c.c

#include <stddef.h>

int is_eof_block(const char test[static 512])
{
    size_t i;

    for (i = 0; i < 512; i++)
        if (test[i] != '\0')
            return 0;

    return 1;
}

is_eof_block_asm.s

    .text
    .globl is_eof_block
    .type is_eof_block,@function

    .align 16
is_eof_block:
    mov $64,%ecx
    xor %eax,%eax
    repz scasq
    setz %al
    ret
    .size is_eof_block,.-is_eof_block

下面是用C语言实现 is_eof_block 的链接的输出:

Here is the output with the C implementation of is_eof_block linked in:

testing 10000000 times
no byte set to 1... 2.281250s
with non-null byte... 1.195312s

和这里是集版本:

testing 10000000 times
no byte set to 1... 0.476562s
with non-null byte... 0.320312s

两人都被编译一个gcc 5与唯一的优化选项是 -O3 。通过各种 -march = ... 标记并没有改变code。的差大约是四的因子。随着对齐缓冲区,装配执行慢大约3%,而没有与C实现的没有什么区别。

Both have been compiled with a gcc 5 with the sole optimization option being -O3. Passing various -march=... flags didn't change the code. The difference is about a factor of four. With a misaligned buffer, the assembly implementation is roughly 3% slower whereas there is no difference with the C implementation.

推荐答案

由于真正有用的意见的问题,我已经决定去与原来的C code。感谢大家的帮助!

Due to the genuinely helpful comments to the question, I have decided to go with the original C code. Thanks all of you for your help!

这篇关于如何获得GCC创造体面code,检查是否缓冲区充满了NUL字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆