为什么要使glibc的异常复杂以至于无法快速运行? [英] Why does glibc's strlen need to be so complicated to run quickly?

查看:84
本文介绍了为什么要使glibc的异常复杂以至于无法快速运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查看strlen代码此处,想知道是否真的需要代码中使用的优化吗?例如,为什么下面这样的东西不能同样好或更好?

I was looking through the strlen code here and I was wondering if the optimizations used in the code are really needed? For example, why wouldn't something like the following work equally good or better?

unsigned long strlen(char s[]) {
    unsigned long i;
    for (i = 0; s[i] != '\0'; i++)
        continue;
    return i;
}

对于编译器而言,不是更简单的代码变得更好和/或更容易了吗?

Isn't simpler code better and/or easier for the compiler to optimize?

链接后面页面上的strlen代码如下:

The code of strlen on the page behind the link looks like this:

/* Copyright (C) 1991, 1993, 1997, 2000, 2003 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   Written by Torbjorn Granlund (tege@sics.se),
   with help from Dan Sahlin (dan@sics.se);
   commentary by Jim Blandy (jimb@ai.mit.edu).

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

#include <string.h>
#include <stdlib.h>

#undef strlen

/* Return the length of the null-terminated string STR.  Scan for
   the null terminator quickly by testing four bytes at a time.  */
size_t
strlen (str)
     const char *str;
{
  const char *char_ptr;
  const unsigned long int *longword_ptr;
  unsigned long int longword, magic_bits, himagic, lomagic;

  /* Handle the first few characters by reading one character at a time.
     Do this until CHAR_PTR is aligned on a longword boundary.  */
  for (char_ptr = str; ((unsigned long int) char_ptr
            & (sizeof (longword) - 1)) != 0;
       ++char_ptr)
    if (*char_ptr == '\0')
      return char_ptr - str;

  /* All these elucidatory comments refer to 4-byte longwords,
     but the theory applies equally well to 8-byte longwords.  */

  longword_ptr = (unsigned long int *) char_ptr;

  /* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
     the "holes."  Note that there is a hole just to the left of
     each byte, with an extra at the end:

     bits:  01111110 11111110 11111110 11111111
     bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD

     The 1-bits make sure that carries propagate to the next 0-bit.
     The 0-bits provide holes for carries to fall into.  */
  magic_bits = 0x7efefeffL;
  himagic = 0x80808080L;
  lomagic = 0x01010101L;
  if (sizeof (longword) > 4)
    {
      /* 64-bit version of the magic.  */
      /* Do the shift in two steps to avoid a warning if long has 32 bits.  */
      magic_bits = ((0x7efefefeL << 16) << 16) | 0xfefefeffL;
      himagic = ((himagic << 16) << 16) | himagic;
      lomagic = ((lomagic << 16) << 16) | lomagic;
    }
  if (sizeof (longword) > 8)
    abort ();

  /* Instead of the traditional loop which tests each character,
     we will test a longword at a time.  The tricky part is testing
     if *any of the four* bytes in the longword in question are zero.  */
  for (;;)
    {
      /* We tentatively exit the loop if adding MAGIC_BITS to
     LONGWORD fails to change any of the hole bits of LONGWORD.

     1) Is this safe?  Will it catch all the zero bytes?
     Suppose there is a byte with all zeros.  Any carry bits
     propagating from its left will fall into the hole at its
     least significant bit and stop.  Since there will be no
     carry from its most significant bit, the LSB of the
     byte to the left will be unchanged, and the zero will be
     detected.

     2) Is this worthwhile?  Will it ignore everything except
     zero bytes?  Suppose every byte of LONGWORD has a bit set
     somewhere.  There will be a carry into bit 8.  If bit 8
     is set, this will carry into bit 16.  If bit 8 is clear,
     one of bits 9-15 must be set, so there will be a carry
     into bit 16.  Similarly, there will be a carry into bit
     24.  If one of bits 24-30 is set, there will be a carry
     into bit 31, so all of the hole bits will be changed.

     The one misfire occurs when bits 24-30 are clear and bit
     31 is set; in this case, the hole at bit 31 is not
     changed.  If we had access to the processor carry flag,
     we could close this loophole by putting the fourth hole
     at bit 32!

     So it ignores everything except 128's, when they're aligned
     properly.  */

      longword = *longword_ptr++;

      if (
#if 0
      /* Add MAGIC_BITS to LONGWORD.  */
      (((longword + magic_bits)

        /* Set those bits that were unchanged by the addition.  */
        ^ ~longword)

       /* Look at only the hole bits.  If any of the hole bits
          are unchanged, most likely one of the bytes was a
          zero.  */
       & ~magic_bits)
#else
      ((longword - lomagic) & himagic)
#endif
      != 0)
    {
      /* Which of the bytes was the zero?  If none of them were, it was
         a misfire; continue the search.  */

      const char *cp = (const char *) (longword_ptr - 1);

      if (cp[0] == 0)
        return cp - str;
      if (cp[1] == 0)
        return cp - str + 1;
      if (cp[2] == 0)
        return cp - str + 2;
      if (cp[3] == 0)
        return cp - str + 3;
      if (sizeof (longword) > 4)
        {
          if (cp[4] == 0)
        return cp - str + 4;
          if (cp[5] == 0)
        return cp - str + 5;
          if (cp[6] == 0)
        return cp - str + 6;
          if (cp[7] == 0)
        return cp - str + 7;
        }
    }
    }
}
libc_hidden_builtin_def (strlen)

为什么此版本运行速度很快?

Why does this version run quickly?

不是在做很多不必要的工作吗?

Isn't it doing a lot of unnecessary work?

推荐答案

不需要需要,并且绝不应该编写这样的代码-特别是如果您不是C编译器/标准库供应商.它是用于通过一些非常可疑的速度黑客和假设(未经断言测试或注释中未提及)来实现strlen的代码:

You don't need and you should never write code like that - especially if you're not a C compiler / standard library vendor. It is code used to implement strlen with some very questionable speed hacks and assumptions (that are not tested with assertions or mentioned in the comments):

  • unsigned long是4或8个字节
  • 字节为8位
  • 可以将指针强制转换为unsigned long long而不是uintptr_t
  • 只需检查2或3个最低位是否为零,就可以对齐指针
  • 一个人可以以unsigned long s
  • 的形式访问字符串
  • 一个人可以读到数组末尾而没有任何不良影响.
  • unsigned long is either 4 or 8 bytes
  • bytes are 8 bits
  • a pointer can be cast to unsigned long long and not uintptr_t
  • one can align the pointer simply by checking that the 2 or 3 lowest order bits are zero
  • one can access a string as unsigned longs
  • one can read past the end of array without any ill effects.

此外,好的编译器甚至可以替换写为的代码

What is more, a good compiler could even replace code written as

size_t stupid_strlen(const char s[]) {
    size_t i;
    for (i=0; s[i] != '\0'; i++)
        ;
    return i;
}

(注意它必须是与size_t兼容的类型)和内置的编译器strlen的内联版本,或者对代码进行矢量化处理;但是编译器不太可能优化复杂版本.

(notice that it has to be a type compatible with size_t) with an inlined version of the compiler builtin strlen, or vectorize the code; but a compiler would be unlikely to be able to optimize the complex version.

strlen函数由 C11 7.24.6.3 为:

说明

  1. strlen函数计算s指向的字符串的长度.
  1. The strlen function computes the length of the string pointed to by s.

返回

  1. strlen函数返回终止的空字符之前的字符数.
  1. The strlen function returns the number of characters that precede the terminating null character.

现在,如果s指向的字符串位于足够长的字符数组中,足以包含该字符串和终止NUL,则行为将为未定义 >如果我们访问空终止符之后的字符串,例如

Now, if the string pointed to by s was in an array of characters just long enough to contain the string and the terminating NUL, the behaviour will be undefined if we access the string past the null terminator, for example in

char *str = "hello world";  // or
char array[] = "hello world";

因此,在完全可移植的/符合标准的C语言中,唯一方法是正确地实现此 的一种方式,它是在您的问题中编写的方式,除了微不足道的转换之外-您可以通过展开循环等来假装更快,但是仍然需要一次完成一个字节.

So really the only way in fully portable / standards compliant C to implement this correctly is the way it is written in your question, except for trivial transformations - you can pretend to be faster by unrolling the loop etc, but it still needs to be done one byte at a time.

(正如评论者所指出的那样,当严格的可移植性负担太大时,利用合理或已知安全的假设并不总是一件坏事.尤其是在属于 部分的代码中特定的C实现.但是您必须先了解规则,然后才能知道如何/何时弯曲它们.)

(As commenters have pointed out, when strict portability is too much of a burden, taking advantage of reasonable or known-safe assumptions is not always a bad thing. Especially in code that's part of one specific C implementation. But you have to understand the rules before knowing how/when you can bend them.)

链接的strlen实现首先单独检查字节,直到指针指向unsigned long的自然4或8字节对齐边界为止. C标准说,访问未正确对齐的指针具有未定义的行为,因此绝对必须这样做,以使下一个肮脏的技巧变得更加肮脏. (实际上,在x86以外的某些CPU体系结构上,未对齐的字或双字加载将出错.C不是可移植的汇编语言,但是此代码以这种方式使用了它.)这也是在对象保护在对齐的块(例如4kiB虚拟内存页)中工作的实现中没有错误风险的情况下读取对象末尾的可能性.

The linked strlen implementation first checks the bytes individually until the pointer is pointing to the natural 4 or 8 byte alignment boundary of the unsigned long. The C standard says that accessing a pointer that is not properly aligned has undefined behaviour, so this absolutely has to be done for the next dirty trick to be even dirtier. (In practice on some CPU architecture other than x86, a misaligned word or doubleword load will fault. C is not a portable assembly language, but this code is using it that way). It's also what makes it possible to read past the end of an object without risk of faulting on implementations where memory protection works in aligned blocks (e.g. 4kiB virtual memory pages).

现在是最肮脏的部分:代码打破承诺,一次读取4或8个8位字节(a long int),并使用带有无符号加法的技巧来快速找出在这4或8个字节中是否有任何个零字节-它使用特制的数字来导致进位改变由位掩码捕获的位.从本质上讲,这将找出掩码中4或8个字节中的任何一个是否比循环遍历这些字节中的任何一个更快.最后,最后有一个循环,找出哪个字节是第一个零(如果有的话),并返回结果.

Now comes the dirty part: the code breaks the promise and reads 4 or 8 8-bit bytes at a time (a long int), and uses a bit trick with unsigned addition to quickly figure out if there were any zero bytes within those 4 or 8 bytes - it uses a specially crafted number to that would cause the carry bit to change bits that are caught by a bit mask. In essence this would then figure out if any of the 4 or 8 bytes in the mask are zeroes supposedly faster than looping through each of these bytes would. Finally there is a loop at the end to figure out which byte was the first zero, if any, and to return the result.

最大的问题是,在sizeof (unsigned long)情况下,在sizeof (unsigned long) - 1情况下,它将读取字符串的末尾-仅当空字节位于 last 访问的字节中时(即,在小端顺序最重要,而大端顺序最不重要),是否不是越界访问数组!

The biggest problem is that in sizeof (unsigned long) - 1 times out of sizeof (unsigned long) cases it will read past the end of the string - only if the null byte is in the last accessed byte (i.e. in little-endian the most significant, and in big-endian the least significant), does it not access the array out of bounds!

即使用于在C标准库中实现strlen的代码也是 bad 代码.它具有几个实现定义和未定义的方面,因此不应在任何地方使用代替系统提供的strlen-我在此处将函数重命名为the_strlen并添加了以下:

The code, even though used to implement strlen in a C standard library is bad code. It has several implementation-defined and undefined aspects in it and it should not be used anywhere instead of the system-provided strlen - I renamed the function to the_strlen here and added the following main:

int main(void) {
    char buf[12];
    printf("%zu\n", the_strlen(fgets(buf, 12, stdin)));
}

缓冲区的大小经过精心设计,以使其可以精确容纳hello world字符串和终止符.但是,在我的64位处理器上,unsigned long是8个字节,因此对后半部分的访问将超出此缓冲区.

The buffer is carefully sized so that it can hold exactly the hello world string and the terminator. However on my 64-bit processor the unsigned long is 8 bytes, so the access to the latter part would exceed this buffer.

如果现在使用-fsanitize=undefined-fsanitize=address进行编译并运行生成的程序,则会得到:

If I now compile with -fsanitize=undefined and -fsanitize=address and run the resulting program, I get:

% ./a.out
hello world
=================================================================
==8355==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffffe63a3f8 at pc 0x55fbec46ab6c bp 0x7ffffe63a350 sp 0x7ffffe63a340
READ of size 8 at 0x7ffffe63a3f8 thread T0
    #0 0x55fbec46ab6b in the_strlen (.../a.out+0x1b6b)
    #1 0x55fbec46b139 in main (.../a.out+0x2139)
    #2 0x7f4f0848fb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
    #3 0x55fbec46a949 in _start (.../a.out+0x1949)

Address 0x7ffffe63a3f8 is located in stack of thread T0 at offset 40 in frame
    #0 0x55fbec46b07c in main (.../a.out+0x207c)

  This frame has 1 object(s):
    [32, 44) 'buf' <== Memory access at offset 40 partially overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (.../a.out+0x1b6b) in the_strlen
Shadow bytes around the buggy address:
  0x10007fcbf420: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf440: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf450: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf460: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x10007fcbf470: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00[04]
  0x10007fcbf480: f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf490: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf4a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf4b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fcbf4c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==8355==ABORTING

即坏事发生了.

这篇关于为什么要使glibc的异常复杂以至于无法快速运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆