为什么在启用 GCC 优化的情况下,使用 strlen 的代码要慢 6.5 倍? [英] Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?

查看:42
本文介绍了为什么在启用 GCC 优化的情况下,使用 strlen 的代码要慢 6.5 倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于某种原因,我想对 glibcstrlen 函数进行基准测试,结果发现在 GCC 和我不知道为什么.

这是我的代码:

#include #include #include #include int main() {char *s = calloc(1 << 20, 1);memset(s, 65, 1000000);时钟_t开始=时钟();for (int i = 0; i <128; ++i) {s[strlen(s)] = 'A';}clock_t 结束 = 时钟();printf("%lld
", (long long)(end - start));返回0;}

在我的机器上它输出:

$ gcc test.c &&./a.out13336$ gcc -O1 test.c &&./a.out199004$ gcc -O2 test.c &&./a.out83415$ gcc -O3 test.c &&./a.out83415

不知何故,启用优化会导致它执行更长时间.

解决方案

Godbolt 的编译器浏览器 提供以下解释:

  • -O0或没有优化的情况下,生成的代码调用C库函数strlen;
  • -O1 生成的代码使用一个简单的内联扩展,使用 rep scasb 指令;
  • -O2 及以上,生成的代码使用更精细的内联扩展.

反复对您的代码进行基准测试会显示从一次运行到另一次运行的巨大差异,但增加迭代次数表明:

  • -O1 代码比 C 库实现慢得多:32240 vs 3090
  • -O2 代码比 -O1 快,但仍然比 C 库代码慢很多:8570 vs 3090.

此行为特定于 gcc 和 GNU libc.使用 clang 和 Apple 的 Libc 在 OS/X 上进行的相同测试没有显示出显着差异,这并不奇怪,因为 Godbolt 显示 clang 生成对 C 库的调用 strlen 在所有优化级别.

这可能被认为是 gcc/glibc 中的一个错误,但更广泛的基准测试可能表明调用 strlen 的开销比小字符串的内联代码缺乏性能具有更重要的影响.您的基准测试中的字符串非常大,因此将基准测试集中在超长字符串上可能不会给出有意义的结果.

我改进了这个基准测试并测试了各种字符串长度.从在 3.10GHz 的 Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz 上运行的 gcc (Debian 4.7.2-5) 4.7.2 的 Linux 基准测试看来,-O1<生成的内联代码/code> 总是比较慢,对于中等长度的字符串,慢了 10 倍,而 -O2 只比 libc strlen 用于非常短的字符串,而对于较长的字符串则速度减半.从这些数据来看,strlen 的 GNU C 库版本对于大多数字符串长度都非常有效,至少在我的特定硬件上是这样.还要记住,缓存对基准测试有重大影响.

这是更新的代码:

#include #include #include 无效基准(int重复,int minlen,int maxlen){char *s = malloc(maxlen + 1);memset(s, 'A', minlen);长长字节= 0,调用= 0;clock_t clk = 时钟();for (int n = 0; n < 重复; n++) {for (int i = minlen; i < maxlen; ++i) {字节 += i + 1;调用 += 1;s[i] = '';s[strlen(s)] = 'A';}}时钟 = 时钟() - 时钟;免费;双 avglen = (minlen + maxlen - 1)/2.0;双 ns = (double)clk * 1e9/CLOCKS_PER_SEC;printf("平均长度 %7.0f -> 平均时间:%7.3f ns/byte, %7.3f ns/call
",avglen、ns/字节、ns/调用);}int main() {基准(10000000, 0, 1);基准(1000000, 0, 10);基准(1000000, 5, 15);基准(100000, 0, 100);基准(100000、50、150);基准(10000, 0, 1000);基准(10000、500、1500);基准(1000、0、10000);基准(1000、5000、15000);基准(100, 1000000 - 50, 1000000 + 50);返回0;}

这是输出:

<前>chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out平均长度 0 -> 平均时间:14.000 ns/字节,14.000 ns/调用平均长度 4 -> 平均时间:2.364 ns/字节,13.000 ns/调用平均长度 10 -> 平均时间:1.238 ns/字节,13.000 ns/调用平均长度 50 -> 平均时间:0.317 ns/字节,16.000 ns/调用平均长度 100 -> 平均时间:0.169 ns/字节,17.000 ns/调用平均长度 500 -> 平均时间:0.074 ns/字节,37.000 ns/调用平均长度 1000 -> 平均时间:0.068 ns/字节,68.000 ns/调用平均长度 5000 -> 平均时间:0.064 ns/字节,318.000 ns/调用平均长度 10000 -> 平均时间:0.062 ns/字节,622.000 ns/调用平均长度 1000000 -> 平均时间:0.062 ns/字节,62000.000 ns/调用chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out平均长度 0 -> 平均时间:20.000 ns/字节,20.000 ns/调用平均长度 4 -> 平均时间:3.818 ns/字节,21.000 ns/调用平均长度 10 -> 平均时间:2.190 ns/字节,23.000 ns/调用平均长度 50 -> 平均时间:0.990 ns/字节,50.000 ns/调用平均长度 100 -> 平均时间:0.816 ns/字节,82.000 ns/调用平均长度 500 -> 平均时间:0.679 ns/字节,340.000 ns/调用平均长度 1000 -> 平均时间:0.664 ns/字节,664.000 ns/调用平均长度 5000 -> 平均时间:0.651 ns/字节,3254.000 ns/调用平均长度 10000 -> 平均时间:0.649 ns/字节,6491.000 ns/调用平均长度 1000000 -> 平均时间:0.648 ns/字节,648000.000 ns/调用chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out平均长度 0 -> 平均时间:10.000 ns/字节,10.000 ns/调用平均长度 4 -> 平均时间:2.000 ns/字节,11.000 ns/调用平均长度 10 -> 平均时间:1.048 ns/字节,11.000 ns/调用平均长度 50 -> 平均时间:0.337 ns/字节,17.000 ns/调用平均长度 100 -> 平均时间:0.299 ns/字节,30.000 ns/调用平均长度 500 -> 平均时间:0.202 ns/字节,101.000 ns/调用平均长度 1000 -> 平均时间:0.188 ns/字节,188.000 ns/调用平均长度 5000 -> 平均时间:0.174 ns/字节,868.000 ns/调用平均长度 10000 -> 平均时间:0.172 ns/字节,1716.000 ns/调用平均长度 1000000 -> 平均时间:0.172 ns/字节,172000.000 ns/调用

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.

Here's my code:

#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    char *s = calloc(1 << 20, 1);
    memset(s, 65, 1000000);
    clock_t start = clock();
    for (int i = 0; i < 128; ++i) {
        s[strlen(s)] = 'A';
    }
    clock_t end = clock();
    printf("%lld
", (long long)(end - start));
    return 0;
}

On my machine it outputs:

$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415

Somehow, enabling optimizations causes it to execute longer.

解决方案

Testing your code on Godbolt's Compiler Explorer provides this explanation:

  • at -O0 or without optimisations, the generated code calls the C library function strlen;
  • at -O1 the generated code uses a simple inline expansion using a rep scasb instruction;
  • at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows substantial variations from one run to another, but increasing the number of iterations shows that:

  • the -O1 code is much slower than the C library implementation: 32240 vs 3090
  • the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

This behavior is specific to gcc and the GNU libc. The same test on OS/X with clang and Apple's Libc does not show significant differences, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.

This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings in your benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.

I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1 is always slower, by as much as a factor of 10 for moderately long strings, while -O2 is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.

Here is the updated code:

#include <stdlib.h>
#include <string.h>
#include <time.h>

void benchmark(int repeat, int minlen, int maxlen) {
    char *s = malloc(maxlen + 1);
    memset(s, 'A', minlen);
    long long bytes = 0, calls = 0;
    clock_t clk = clock();
    for (int n = 0; n < repeat; n++) {
        for (int i = minlen; i < maxlen; ++i) {
            bytes += i + 1;
            calls += 1;
            s[i] = '';
            s[strlen(s)] = 'A';
        }
    }
    clk = clock() - clk;
    free(s);
    double avglen = (minlen + maxlen - 1) / 2.0;
    double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
    printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/call
",
           avglen, ns / bytes, ns / calls);
}

int main() {
    benchmark(10000000, 0, 1);
    benchmark(1000000, 0, 10);
    benchmark(1000000, 5, 15);
    benchmark(100000, 0, 100);
    benchmark(100000, 50, 150);
    benchmark(10000, 0, 1000);
    benchmark(10000, 500, 1500);
    benchmark(1000, 0, 10000);
    benchmark(1000, 5000, 15000);
    benchmark(100, 1000000 - 50, 1000000 + 50);
    return 0;
}

Here is the output:

chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length       0 -> avg time:  14.000 ns/byte,  14.000 ns/call
average length       4 -> avg time:   2.364 ns/byte,  13.000 ns/call
average length      10 -> avg time:   1.238 ns/byte,  13.000 ns/call
average length      50 -> avg time:   0.317 ns/byte,  16.000 ns/call
average length     100 -> avg time:   0.169 ns/byte,  17.000 ns/call
average length     500 -> avg time:   0.074 ns/byte,  37.000 ns/call
average length    1000 -> avg time:   0.068 ns/byte,  68.000 ns/call
average length    5000 -> avg time:   0.064 ns/byte, 318.000 ns/call
average length   10000 -> avg time:   0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time:   0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length       0 -> avg time:  20.000 ns/byte,  20.000 ns/call
average length       4 -> avg time:   3.818 ns/byte,  21.000 ns/call
average length      10 -> avg time:   2.190 ns/byte,  23.000 ns/call
average length      50 -> avg time:   0.990 ns/byte,  50.000 ns/call
average length     100 -> avg time:   0.816 ns/byte,  82.000 ns/call
average length     500 -> avg time:   0.679 ns/byte, 340.000 ns/call
average length    1000 -> avg time:   0.664 ns/byte, 664.000 ns/call
average length    5000 -> avg time:   0.651 ns/byte, 3254.000 ns/call
average length   10000 -> avg time:   0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time:   0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length       0 -> avg time:  10.000 ns/byte,  10.000 ns/call
average length       4 -> avg time:   2.000 ns/byte,  11.000 ns/call
average length      10 -> avg time:   1.048 ns/byte,  11.000 ns/call
average length      50 -> avg time:   0.337 ns/byte,  17.000 ns/call
average length     100 -> avg time:   0.299 ns/byte,  30.000 ns/call
average length     500 -> avg time:   0.202 ns/byte, 101.000 ns/call
average length    1000 -> avg time:   0.188 ns/byte, 188.000 ns/call
average length    5000 -> avg time:   0.174 ns/byte, 868.000 ns/call
average length   10000 -> avg time:   0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time:   0.172 ns/byte, 172000.000 ns/call

这篇关于为什么在启用 GCC 优化的情况下,使用 strlen 的代码要慢 6.5 倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆