非本机长度的有符号和无符号整数的性能差异 [英] Performance difference of signed and unsigned integers of non-native length

查看:72
本文介绍了非本机长度的有符号和无符号整数的性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个演讲, CppCon 2016:Chandler Carruth垃圾进来,垃圾进出:争论未定义行为... ,其中Carruth先生展示了bzip代码中的示例。他们使用 uint32_t i1 作为索引。在64位上系统的数组访问 block [i1] 将执行 *(block + i1)。问题是 block 是64位指针,而 i1 是32位数字,由于未定义整数已定义,因此加法可能会溢出溢出行为编译器需要添加额外的指令以确保即使在64位系统上也能实现这一要求。



我也想用一个简单的例子来说明因此,我尝试使用带有各种有符号和无符号整数的 ++ i 代码。以下是我的测试代码:

  #include< cstdint> 

void test_int8(){int8_t i = 0; ++ i;}
无效test_uint8(){uint8_t i = 0; ++ i; }

void test_int16(){int16_t i = 0; ++ i; }
void test_uint16(){uint16_t i = 0; ++ i; }

void test_int32(){int32_t i = 0; ++ i; }
void test_uint32(){uint32_t i = 0; ++ i; }

void test_int64(){int64_t i = 0; ++ i; }
void test_uint64(){uint64_t i = 0; ++ i; }

使用 g ++ -c test.cpp objdump -d test.o 我得到了像
这样的汇编清单:

  000000000000004e< _Z10test_int32v> ;: 
4e:55 push%rbp
4f:48 89 e5 mov%rsp,%rbp
52:c7 45 fc 00 00 00 00 movl $ 0x0,-0x4(%rbp)
59:83 45 fc 01 addl $ 0x1,-0x4(%rbp)
5d:90 nop
5e:5d pop%rbp
5f:c3 retq

说实话:我对x86汇编的了解非常有限,因此我的关注
的结论和问题可能很幼稚。似乎是返回值。仅删除这些行,以下
的内核将保留各种数据类型:




  • int8_t

      4:c6 45 ff 00 movb $ 0x0,-0x1(%rbp) 
    8:0f b6 45 ff movzbl -0x1(%rbp),%eax
    c:83 c0 01加$ 0x1,%eax
    f:88 45 ff mov%al,-0x1(% rbp)


  • uint8_t

      19:c6 45 ff 00 movb $ 0x0,-0x1(%rbp)
    1d:80 45 ff 01 addb $ 0x1,-0x1(%rbp)


  • int16_t

      28:66 c7 45 fe 00 00 movw $ 0x0,-0x2(%rbp)
    2e:0f b7 45 fe movzwl -0x2(%rbp),%eax
    32:83 c0 01加$ 0x1,%eax
    35:66 89 45 fe mov%ax,-0x2( %rbp)


  • uint16_t

      40:66 c7 45 fe 00 00 mov $ 0x0,-0x2(%rbp)
    46:66 83 45 fe 01 addw $ 0x1,-0x2(%rbp)


  • int32_t

      52:c7 45 fc 00 00 00 00 movl $ 0x0,-0x4(%rbp)
    59:83 45 fc 01 addl $ 0x1,-0x4(%rbp)


  • uint32_t

      64:c7 45 fc 00 00 00 00 movl $ 0x0,-0x4(%rbp)
    6b:83 45 fc 01 addl $ 0x1,-0x4(%rbp)


  • int64_t

      76:48 c7 45 f8 00 00 00 movq $ 0x0,-0x8(%rbp) 
    7d:00
    7e:48 83 45 f8 01 addq $ 0x1,-0x8(%rbp)


  • uint64_t

      8a :48 c7 45 f8 00 00 00 movq $ 0x0,-0x8(%rbp)
    91:00
    92:48 83 45 f8 01 addq $ 0x1,-0x8(%rbp)




将带符号的版本与未签名的版本进行比较,我会从
Carruth先生的演讲中期望生成额外的掩蔽指令。



但是对于 int8_t 我们将一个字节( movb )装入%rbp ,然后将
装入零位并加长( movzbl )到累加器%eax 中。加法运算( add )是
,没有任何大小说明,因为无论如何都没有定义溢出
。无符号版本直接使用字节指令。



add addb两者 / addw / addl / addq 因为Intel Sandy Bridge CPU具有适用于所有
大小的硬件加法器,或者32位单元在内部进行屏蔽,因此具有更长的
延迟,因此需要相同数量的
周期(延迟)。



我在查找带有延迟的表,发现
agner.org
中的一个。每个CPU
(在此处使用Sandy Bridge)只有一个 ADD 条目,但我确实
没有看到其他宽度变量的条目。 《 Intel 64和IA-32体系结构优化参考手册》 似乎也只列出了一条 add 指令。



这是否意味着在x86上,非本征长度整数的 ++ i 对于无符号类型实际上要快
,因为指令少了吗?

解决方案

这个问题有两个部分:钱德勒关于基于溢出的优化的观点是不确定的,以及您在



钱德勒的观点是,如果溢出是未定义的行为,则编译器可以假定它不会发生。考虑下面的代码:

  typedef int T; 
void CopyInts(int * dest,const int * src){
T x = 0;
for(; src [x]; ++ x){
dest [x] = src [x];
}
}

此处,编译器可以安全地更改 for 循环到以下内容:

 而(* src){
* dest ++ = * src ++;
}

这是因为编译器不必担心 x 溢出。如果编译器不得不担心 x 溢出,则源指针和目标指针会突然从中减去16 GB,因此上面的简单转换将不起作用。



在汇编级别,以上内容是(对于x86-64,使用GCC 7.3.0, -O2 ):

  _Z8CopyIntsPiPKii:
movl(%rsi),%edx
testl%edx,% edx
je .L1
xorl%eax,%eax
.L3:
movl%edx,(%rdi,%rax)
addq $ 4,%rax
movl(%rsi,%rax),%edx
testl%edx,%edx
jne .L3
.L1:
rep ret

如果我们将 T 更改为 unsigned int ,我们得到的代码却变慢了:

  _Z8CopyIntsPiPKij:
movl(%rsi) ,%eax
testl%eax,%eax
je .L1
xorl%edx,%edx
xorl%ecx,%ecx
.L3:
move%eax,(%rdi,%rcx)
leal 1(%rdx),%eax
movq%rax,%rdx
leaq 0(,%rax,4),%rcx
movl(%rsi,%rax,4),%eax
testl%eax,%eax
jne .L3
。 L1:
rep ret

此处,编译器保留 x 作为单独的变量,以便正确处理溢出。



可以使用大小类型来代替依靠未定义带符号的溢出来提高性能。与指针的大小相同。这意味着这样的变量只能与指针同时溢出,该指针也是未定义的。因此,至少对于x86-64, size_t 也可以作为 T 来获得更好的性能。



现在是问题的第二部分: add 指令。 add 指令的后缀来自x86汇编语言的所谓 AT& T风格。在AT& T汇编语言中,参数比Intel编写指令的方式落后,并且消除歧义的指令大小是通过在助记符中添加后缀而不是类似 dword ptr 的方式来完成的。



示例:



Intel:添加dword ptr [eax ],1



AT& T: addl $ 1,(%eax)



这些是同一条指令,只是写法不同。 l 代替了 dword ptr



在AT& T指令中缺少后缀的情况下,这是因为它不是必需的:操作数隐含了大小。



add $ 1,%eax



后缀 l 是不必要的,因为指令显然是32位,因为 eax 是。



简而言之,它与溢出无关。溢出总是在处理器级别定义的。在某些架构上,例如在MIPS上使用非 u 指令时,溢出会引发异常,但仍是定义的。 C / C ++是唯一使溢出无法预测的行为的主要语言。


There is this talk, CppCon 2016: Chandler Carruth "Garbage In, Garbage Out: Arguing about Undefined Behavior...", where Mr. Carruth shows an example from the bzip code. They have used uint32_t i1 as an index. On a 64-bit system the array access block[i1] will then do *(block + i1). The issue is that block is a 64-bit pointer whereas i1 is a 32-bit number. The addition might overflow and since unsigned integers have defined overflow behavior the compiler needs to add extra instructions to make sure that this is indeed fulfilled even on a 64-bit system.

I would like to also show this with a simple example. So I have tried the ++i code with various signed and unsigned integers. The following is my test code:

#include <cstdint>

void test_int8() { int8_t i = 0; ++i; }
void test_uint8() { uint8_t i = 0; ++i; }

void test_int16() { int16_t i = 0; ++i; }
void test_uint16() { uint16_t i = 0; ++i; }

void test_int32() { int32_t i = 0; ++i; }
void test_uint32() { uint32_t i = 0; ++i; }

void test_int64() { int64_t i = 0; ++i; }
void test_uint64() { uint64_t i = 0; ++i; } 

With g++ -c test.cpp and objdump -d test.o I get assembly listings like this:

000000000000004e <_Z10test_int32v>:
  4e:   55                      push   %rbp
  4f:   48 89 e5                mov    %rsp,%rbp
  52:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  59:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
  5d:   90                      nop
  5e:   5d                      pop    %rbp
  5f:   c3                      retq   

To be honest: My knowledge of x86 assembly is rather limited, so my following conclusions and questions may be very naive.

The first two instructions seem to be only from the call of a function, the last three ones seem to be the return value. Removing only these lines, the following kernels are left for the various data types:

  • int8_t:

       4:   c6 45 ff 00             movb   $0x0,-0x1(%rbp)
       8:   0f b6 45 ff             movzbl -0x1(%rbp),%eax
       c:   83 c0 01                add    $0x1,%eax
       f:   88 45 ff                mov    %al,-0x1(%rbp)
    

  • uint8_t:

      19:   c6 45 ff 00             movb   $0x0,-0x1(%rbp)
      1d:   80 45 ff 01             addb   $0x1,-0x1(%rbp)
    

  • int16_t:

      28:   66 c7 45 fe 00 00       movw   $0x0,-0x2(%rbp)
      2e:   0f b7 45 fe             movzwl -0x2(%rbp),%eax
      32:   83 c0 01                add    $0x1,%eax
      35:   66 89 45 fe             mov    %ax,-0x2(%rbp)
    

  • uint16_t:

      40:   66 c7 45 fe 00 00       movw   $0x0,-0x2(%rbp)
      46:   66 83 45 fe 01          addw   $0x1,-0x2(%rbp)
    

  • int32_t:

      52:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
      59:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
    

  • uint32_t:

      64:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
      6b:   83 45 fc 01             addl   $0x1,-0x4(%rbp)
    

  • int64_t:

      76:   48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
      7d:   00 
      7e:   48 83 45 f8 01          addq   $0x1,-0x8(%rbp)
    

  • uint64_t:

      8a:   48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
      91:   00 
      92:   48 83 45 f8 01          addq   $0x1,-0x8(%rbp)
    

Comparing the signed with the unsigned versions I would have expected from Mr. Carruth's talk that extra masking instructions are generated.

But for int8_t we load a byte (movb) into %rbp, then load and zero-pad it to a long (movzbl) into the accumulator %eax. The addition (add) is performed without any size specification because the overflow is not defined anyway. The unsigned version directly uses instructions for bytes.

Either both add and addb/addw/addl/addq take the same number of cycles (latency) because the Intel Sandy Bridge CPU has hardware adders for all sizes or the 32-bit unit does the masking internally and therefore has a longer latency.

I have looked for a table with latencies and found the one by agner.org. There for each CPU (using Sandy Bridge here) there is only one entry for ADD but I do not see entries for the other width variants. The Intel 64 and IA-32 Architectures Optimization Reference Manual also seems to list only a single add instruction.

Does this mean that on x86 the ++i of non-native length integers is actually faster for unsigned types because there are less instructions?

解决方案

There are two parts of this question: Chandler's point about optimizations based on overflow being undefined, and the differences you found in the assembly output.

Chandler's point is that if overflow is undefined behavior, then the compiler can assume that it cannot happen. Consider the following code:

typedef int T;
void CopyInts(int *dest, const int *src) {
    T x = 0;
    for (; src[x]; ++x) {
        dest[x] = src[x];
    }
}

Here, the compiler can safely change the for loop to the following:

    while (*src) {
        *dest++ = *src++;
    }

That's because the compiler does not have to worry about the case that x overflows. If the compiler has to worry about x overflowing, the source and destination pointers suddenly get 16 GB subtracted from them, so the simple transformation above will not work.

At the assembly level, the above is (with GCC 7.3.0 for x86-64, -O2):

_Z8CopyIntsPiPKii:
  movl (%rsi), %edx
  testl %edx, %edx
  je .L1
  xorl %eax, %eax
.L3:
  movl %edx, (%rdi,%rax)
  addq $4, %rax
  movl (%rsi,%rax), %edx
  testl %edx, %edx
  jne .L3
.L1:
  rep ret

If we change T to be unsigned int, we get this slower code instead:

_Z8CopyIntsPiPKij:
  movl (%rsi), %eax
  testl %eax, %eax
  je .L1
  xorl %edx, %edx
  xorl %ecx, %ecx
.L3:
  movl %eax, (%rdi,%rcx)
  leal 1(%rdx), %eax
  movq %rax, %rdx
  leaq 0(,%rax,4), %rcx
  movl (%rsi,%rax,4), %eax
  testl %eax, %eax
  jne .L3
.L1:
  rep ret

Here, the compiler is keeping x as a separate variable so that overflow is handled properly.

Instead of relying on signed overflow being undefined for performance, you can use a size type that is the same size as a pointer. This means that such a variable could only overflow at the same time as a pointer, which is also undefined. Hence, at least for x86-64, size_t would also work as T to get the better performance.

Now for the second part of your question: the add instruction. The suffixes on the add instruction are from the so-called "AT&T" style of x86 assembly language. In AT&T assembly language, the parameters are backward from the way Intel writes instructions, and disambiguating instruction sizes is done by adding a suffix to the mnemonic instead of something like dword ptr in the Intel case.

Example:

Intel: add dword ptr [eax], 1

AT&T: addl $1, (%eax)

These are the same instruction, just written differently. The l takes the place of dword ptr.

In the case where the suffix is missing from AT&T instructions, it's because it's not required: the size is implicit from the operands.

add $1, %eax

The l suffix is unnecessary because the instruction is obviously 32-bit, because eax is.

In short, it has nothing to do with overflow. Overflow is always defined at the processor level. On some architectures, such as when using the non-u instructions on MIPS, overflow throws an exception, but it's still defined. C/C++ are the only major languages that make overflow unpredictable behavior.

这篇关于非本机长度的有符号和无符号整数的性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆