“switch"比“if"快吗? [英] Is 'switch' faster than 'if'?

查看:36
本文介绍了“switch"比“if"快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

switch 语句实际上是否比 if 语句快?

我在带有 /Ox 标志的 Visual Studio 2010 的 x64 C++ 编译器上运行以下代码:

#include #include #include #define MAX_COUNT (1 << 29)size_t 计数器 = 0;size_t testSwitch(){时钟_t开始=时钟();size_t i;for (i = 0; i 

并得到以下结果:

<块引用>

切换语句:5261毫秒
If 语句:5196 毫秒

据我所知,switch 语句显然使用跳转表来优化分支.

问题:

  1. 在 x86 或 x64 中,基本跳转表会是什么样子?

  2. 这段代码是否使用了跳转表?

  3. 为什么在这个例子中没有性能差异?是否存在显着性能差异的情况?


反汇编代码:

testIf:13FE81B10 sub rsp,48h13FE81B14 调用 qword ptr [__imp_clock (13FE81128h)]13FE81B1A mov dword ptr [开始],eax13FE81B1E mov qword ptr [i],013FE81B27 jmp testIf+26h (13FE81B36h)13FE81B29 mov rax,qword ptr [i]13FE81B2E 包括 rax13FE81B31 mov qword ptr [i],rax13FE81B36 cmp qword ptr [i],20000000h13FE81B3F jae testIf+0C3h (13FE81BD3h)13FE81B45 异或edx,edx13FE81B47 mov rax,qword ptr [计数器 (13FE835D0h)]13FE81B4E mov ecx,413FE81B53 div rax,rcx13FE81B56 mov rax,rdx13FE81B59 包括 rax13FE81B5C mov qword ptr [c],rax13FE81B61 cmp qword ptr [c],113FE81B67 jne testIf+6Dh (13FE81B7Dh)13FE81B69 mov rax,qword ptr [计数器 (13FE835D0h)]13FE81B70 添加 rax,413FE81B74 mov qword ptr [计数器 (13FE835D0h)],rax13FE81B7B jmp testIf+0BEh (13FE81BCEh)13FE81B7D cmp qword ptr [c],213FE81B83 jne testIf+89h (13FE81B99h)13FE81B85 mov rax,qword ptr [计数器 (13FE835D0h)]13FE81B8C 加 rax,313FE81B90 mov qword ptr [计数器 (13FE835D0h)],rax13FE81B97 jmp testIf+0BEh (13FE81BCEh)13FE81B99 cmp qword ptr [c],313FE81B9F jne testIf+0A5h (13FE81BB5h)13FE81BA1 mov rax,qword ptr [计数器 (13FE835D0h)]13FE81BA8 添加 rax,213FE81BAC mov qword ptr [计数器 (13FE835D0h)],rax13FE81BB3 jmp testIf+0BEh (13FE81BCEh)13FE81BB5 cmp qword ptr [c],413FE81BBB jne testIf+0BEh (13FE81BCEh)13FE81BBD mov rax,qword ptr [计数器 (13FE835D0h)]13FE81BC4 包括 rax13FE81BC7 mov qword ptr [计数器 (13FE835D0h)],rax13FE81BCE jmp testIf+19h (13FE81B29h)13FE81BD3 调用 qword ptr [__imp_clock (13FE81128h)]13FE81BD9 sub eax,dword ptr [开始]13FE81BDD imul eax,eax,3E8h13FE81BE3 cdq13FE81BE4 mov ecx,3E8h13FE81BE9 idiv eax,ecx13FE81BEB cdqe13FE81BED 添加 rsp,48h13FE81BF1 ret


testSwitch:13FE81C00 sub rsp,48h13FE81C04 调用 qword ptr [__imp_clock (13FE81128h)]13FE81C0A mov dword ptr [开始],eax13FE81C0E mov qword ptr [i],013FE81C17 jmp testSwitch+26h (13FE81C26h)13FE81C19 mov rax,qword ptr [i]13FE81C1E 包括 rax13FE81C21 mov qword ptr [i],rax13FE81C26 cmp qword ptr [i],20000000h13FE81C2F jae testSwitch+0C5h (13FE81CC5h)13FE81C35 异或edx,edx13FE81C37 mov rax,qword ptr [计数器 (13FE835D0h)]13FE81C3E mov ecx,413FE81C43 div rax,rcx13FE81C46 mov rax,rdx13FE81C49 inc rax13FE81C4C mov qword ptr [rsp+30h],rax13FE81C51 cmp qword ptr [rsp+30h],113FE81C57 je testSwitch+73h (13FE81C73h)13FE81C59 cmp qword ptr [rsp+30h],213FE81C5F je testSwitch+87h (13FE81C87h)13FE81C61 cmp qword ptr [rsp+30h],313FE81C67 je testSwitch+9Bh (13FE81C9Bh)13FE81C69 cmp qword ptr [rsp+30h],413FE81C6F je testSwitch+0AFh (13FE81CAFh)13FE81C71 jmp testSwitch+0C0h (13FE81CC0h)13FE81C73 mov rax,qword ptr [计数器 (13FE835D0h)]13FE81C7A 加 rax,413FE81C7E mov qword ptr [计数器 (13FE835D0h)],rax13FE81C85 jmp testSwitch+0C0h (13FE81CC0h)13FE81C87 mov rax,qword ptr [计数器 (13FE835D0h)]13FE81C8E 添加 rax,313FE81C92 mov qword ptr [计数器 (13FE835D0h)],rax13FE81C99 jmp testSwitch+0C0h (13FE81CC0h)13FE81C9B mov rax,qword ptr [计数器 (13FE835D0h)]13FE81CA2 加 rax,213FE81CA6 mov qword ptr [计数器 (13FE835D0h)],rax13FE81CAD jmp testSwitch+0C0h (13FE81CC0h)13FE81CAF mov rax,qword ptr [计数器 (13FE835D0h)]13FE81CB6 inc rax13FE81CB9 mov qword ptr [计数器 (13FE835D0h)],rax13FE81CC0 jmp testSwitch+19h (13FE81C19h)13FE81CC5 调用 qword ptr [__imp_clock (13FE81128h)]13FE81CCB sub eax,dword ptr [开始]13FE81CCF imul eax,eax,3E8h13FE81CD5 cdq13FE81CD6 mov ecx,3E8h13FE81CDB idiv eax,ecx13FE81CDD cdqe13FE81CDF 添加 rsp,48h13FE81CE3 ret


更新:

有趣的结果此处.不知道为什么一个更快,一个更慢.

解决方案

编译器可以对开关进行多项优化.我不认为经常提到的跳转表"是一个非常有用的,因为它只在输入可以以某种方式有界时才有效.

跳转表"的 C 伪代码类似于 this -- 注意编译器在实践中需要在表格周围插入某种形式的 if 测试以确保输入在表格中有效.另请注意,它仅适用于输入是连续数字的特定情况.

如果 switch 中的分支数量非常多,编译器可以对 switch 的值使用二分搜索,这(在我看来)将是一个更有用的优化,因为它确实显着增加在某些情况下,性能与 switch 一样普遍,并且不会导致更大的生成代码大小.但是要看到这一点,您的测试代码需要更多分支才能看到任何差异.

回答您的具体问题:

  1. Clang 生成一个看起来像 这个:

    test_switch(char): # @test_switch(char)movl %edi, %eaxcmpl 19 美元,%edijbe .LBB0_1回复.LBB0_1:jmpq *.LJTI0_0(,%rax,8)jmp void call<0u>() # TAILCALLjmp void call<1u>() # TAILCALLjmp void call<2u>() # TAILCALLjmp void call<3u>() # TAILCALLjmp void call<4u>() # TAILCALLjmp void call<5u>() # TAILCALLjmp void call<6u>() # TAILCALLjmp void call<7u>() # TAILCALLjmp void call<8u>() # TAILCALLjmp void call<9u>() # TAILCALLjmp void call<10u>() # TAILCALLjmp void call<11u>() # TAILCALLjmp void call<12u>() # TAILCALLjmp void call<13u>() # TAILCALLjmp void call<14u>() # TAILCALLjmp void call<15u>() # TAILCALLjmp void call<16u>() # TAILCALLjmp void call<17u>() # TAILCALLjmp void call<18u>() # TAILCALLjmp void call<19u>() # TAILCALL.LJTI0_0:.quad .LBB0_2.quad .LBB0_3.quad .LBB0_4.quad .LBB0_5.quad .LBB0_6.quad .LBB0_7.quad .LBB0_8.quad .LBB0_9.quad .LBB0_10.quad .LBB0_11.quad .LBB0_12.quad .LBB0_13.quad .LBB0_14.quad .LBB0_15.quad .LBB0_16.quad .LBB0_17.quad .LBB0_18.quad .LBB0_19.quad .LBB0_20.quad .LBB0_21

  2. 我可以说它没有使用跳转表——4条比较指令清晰可见:

    13FE81C51 cmp qword ptr [rsp+30h],113FE81C57 je testSwitch+73h (13FE81C73h)13FE81C59 cmp qword ptr [rsp+30h],213FE81C5F je testSwitch+87h (13FE81C87h)13FE81C61 cmp qword ptr [rsp+30h],313FE81C67 je testSwitch+9Bh (13FE81C9Bh)13FE81C69 cmp qword ptr [rsp+30h],413FE81C6F je testSwitch+0AFh (13FE81CAFh)

    基于跳转表的解决方案根本不使用比较.

  3. 要么没有足够的分支导致编译器生成跳转表,要么您的编译器根本不生成它们.我不确定是哪个.

EDIT 2014:熟悉 LLVM 优化器的人在其他地方进行了一些讨论,称跳转表优化在许多情况下都很重要;例如在存在具有许多值的枚举以及针对所述枚举中的值的许多情况的情况下.也就是说,我支持我在 2011 年所说的话——我经常看到人们在想如果我改变它,无论我有多少病例,时间都会相同"——这完全是错误的.即使使用跳转表,您也会获得间接跳转成本,并为每种情况支付表中的条目;内存带宽对现代硬件来说是个大问题.

编写代码以提高可读性.任何物有所值的编译器都会看到 if/else if 梯形图,并将其转换为等效的开关,反之亦然,如果这样做会更快.>

Is a switch statement actually faster than an if statement?

I ran the code below on Visual Studio 2010's x64 C++ compiler with the /Ox flag:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 29)
size_t counter = 0;

size_t testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        switch (counter % 4 + 1)
        {
            case 1: counter += 4; break;
            case 2: counter += 3; break;
            case 3: counter += 2; break;
            case 4: counter += 1; break;
        }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

size_t testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = counter % 4 + 1;
        if (c == 1) { counter += 4; }
        else if (c == 2) { counter += 3; }
        else if (c == 3) { counter += 2; }
        else if (c == 4) { counter += 1; }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    printf("Starting...
");
    printf("Switch statement: %u ms
", testSwitch());
    printf("If     statement: %u ms
", testIf());
}

and got these results:

Switch statement: 5261 ms
If statement: 5196 ms

From what I've learned, switch statements apparently use jump tables to optimize the branching.

Questions:

  1. What would a basic jump table look like, in x86 or x64?

  2. Is this code using a jump table?

  3. Why is there no performance difference in this example? Is there any situation in which there is a significant performance difference?


Disassembly of the code:

testIf:

13FE81B10 sub  rsp,48h 
13FE81B14 call qword ptr [__imp_clock (13FE81128h)] 
13FE81B1A mov  dword ptr [start],eax 
13FE81B1E mov  qword ptr [i],0 
13FE81B27 jmp  testIf+26h (13FE81B36h) 
13FE81B29 mov  rax,qword ptr [i] 
13FE81B2E inc  rax  
13FE81B31 mov  qword ptr [i],rax 
13FE81B36 cmp  qword ptr [i],20000000h 
13FE81B3F jae  testIf+0C3h (13FE81BD3h) 
13FE81B45 xor  edx,edx 
13FE81B47 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B4E mov  ecx,4 
13FE81B53 div  rax,rcx 
13FE81B56 mov  rax,rdx 
13FE81B59 inc  rax  
13FE81B5C mov  qword ptr [c],rax 
13FE81B61 cmp  qword ptr [c],1 
13FE81B67 jne  testIf+6Dh (13FE81B7Dh) 
13FE81B69 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B70 add  rax,4 
13FE81B74 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B7B jmp  testIf+0BEh (13FE81BCEh) 
13FE81B7D cmp  qword ptr [c],2 
13FE81B83 jne  testIf+89h (13FE81B99h) 
13FE81B85 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B8C add  rax,3 
13FE81B90 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B97 jmp  testIf+0BEh (13FE81BCEh) 
13FE81B99 cmp  qword ptr [c],3 
13FE81B9F jne  testIf+0A5h (13FE81BB5h) 
13FE81BA1 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BA8 add  rax,2 
13FE81BAC mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BB3 jmp  testIf+0BEh (13FE81BCEh) 
13FE81BB5 cmp  qword ptr [c],4 
13FE81BBB jne  testIf+0BEh (13FE81BCEh) 
13FE81BBD mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BC4 inc  rax  
13FE81BC7 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BCE jmp  testIf+19h (13FE81B29h) 
13FE81BD3 call qword ptr [__imp_clock (13FE81128h)] 
13FE81BD9 sub  eax,dword ptr [start] 
13FE81BDD imul eax,eax,3E8h 
13FE81BE3 cdq       
13FE81BE4 mov  ecx,3E8h 
13FE81BE9 idiv eax,ecx 
13FE81BEB cdqe      
13FE81BED add  rsp,48h 
13FE81BF1 ret       


testSwitch:

13FE81C00 sub  rsp,48h 
13FE81C04 call qword ptr [__imp_clock (13FE81128h)] 
13FE81C0A mov  dword ptr [start],eax 
13FE81C0E mov  qword ptr [i],0 
13FE81C17 jmp  testSwitch+26h (13FE81C26h) 
13FE81C19 mov  rax,qword ptr [i] 
13FE81C1E inc  rax  
13FE81C21 mov  qword ptr [i],rax 
13FE81C26 cmp  qword ptr [i],20000000h 
13FE81C2F jae  testSwitch+0C5h (13FE81CC5h) 
13FE81C35 xor  edx,edx 
13FE81C37 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C3E mov  ecx,4 
13FE81C43 div  rax,rcx 
13FE81C46 mov  rax,rdx 
13FE81C49 inc  rax  
13FE81C4C mov  qword ptr [rsp+30h],rax 
13FE81C51 cmp  qword ptr [rsp+30h],1 
13FE81C57 je   testSwitch+73h (13FE81C73h) 
13FE81C59 cmp  qword ptr [rsp+30h],2 
13FE81C5F je   testSwitch+87h (13FE81C87h) 
13FE81C61 cmp  qword ptr [rsp+30h],3 
13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
13FE81C69 cmp  qword ptr [rsp+30h],4 
13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
13FE81C71 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C73 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C7A add  rax,4 
13FE81C7E mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C85 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C87 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C8E add  rax,3 
13FE81C92 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C99 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C9B mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CA2 add  rax,2 
13FE81CA6 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CAD jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81CAF mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CB6 inc  rax  
13FE81CB9 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CC0 jmp  testSwitch+19h (13FE81C19h) 
13FE81CC5 call qword ptr [__imp_clock (13FE81128h)] 
13FE81CCB sub  eax,dword ptr [start] 
13FE81CCF imul eax,eax,3E8h 
13FE81CD5 cdq       
13FE81CD6 mov  ecx,3E8h 
13FE81CDB idiv eax,ecx 
13FE81CDD cdqe      
13FE81CDF add  rsp,48h 
13FE81CE3 ret       


Update:

Interesting results here. Not sure why one is faster and one is slower, though.

解决方案

There are several optimizations a compiler can make on a switch. I don't think the oft-mentioned "jump-table" is a very useful one though, as it only works when the input can be bounded some way.

C Pseudocode for a "jump table" would be something like this -- note that the compiler in practice would need to insert some form of if test around the table to ensure that the input was valid in the table. Note also that it only works in the specific case that the input is a run of consecutive numbers.

If the number of branches in a switch is extremely large, a compiler can do things like using binary search on the values of the switch, which (in my mind) would be a much more useful optimization, as it does significantly increase performance in some scenarios, is as general as a switch is, and does not result in greater generated code size. But to see that, your test code would need a LOT more branches to see any difference.

To answer your specific questions:

  1. Clang generates one that looks like this:

    test_switch(char):                       # @test_switch(char)
            movl    %edi, %eax
            cmpl    $19, %edi
            jbe     .LBB0_1
            retq
    .LBB0_1:
            jmpq    *.LJTI0_0(,%rax,8)
            jmp     void call<0u>()         # TAILCALL
            jmp     void call<1u>()         # TAILCALL
            jmp     void call<2u>()         # TAILCALL
            jmp     void call<3u>()         # TAILCALL
            jmp     void call<4u>()         # TAILCALL
            jmp     void call<5u>()         # TAILCALL
            jmp     void call<6u>()         # TAILCALL
            jmp     void call<7u>()         # TAILCALL
            jmp     void call<8u>()         # TAILCALL
            jmp     void call<9u>()         # TAILCALL
            jmp     void call<10u>()        # TAILCALL
            jmp     void call<11u>()        # TAILCALL
            jmp     void call<12u>()        # TAILCALL
            jmp     void call<13u>()        # TAILCALL
            jmp     void call<14u>()        # TAILCALL
            jmp     void call<15u>()        # TAILCALL
            jmp     void call<16u>()        # TAILCALL
            jmp     void call<17u>()        # TAILCALL
            jmp     void call<18u>()        # TAILCALL
            jmp     void call<19u>()        # TAILCALL
    .LJTI0_0:
            .quad   .LBB0_2
            .quad   .LBB0_3
            .quad   .LBB0_4
            .quad   .LBB0_5
            .quad   .LBB0_6
            .quad   .LBB0_7
            .quad   .LBB0_8
            .quad   .LBB0_9
            .quad   .LBB0_10
            .quad   .LBB0_11
            .quad   .LBB0_12
            .quad   .LBB0_13
            .quad   .LBB0_14
            .quad   .LBB0_15
            .quad   .LBB0_16
            .quad   .LBB0_17
            .quad   .LBB0_18
            .quad   .LBB0_19
            .quad   .LBB0_20
            .quad   .LBB0_21
    

  2. I can say that it is not using a jump table -- 4 comparison instructions are clearly visible:

    13FE81C51 cmp  qword ptr [rsp+30h],1 
    13FE81C57 je   testSwitch+73h (13FE81C73h) 
    13FE81C59 cmp  qword ptr [rsp+30h],2 
    13FE81C5F je   testSwitch+87h (13FE81C87h) 
    13FE81C61 cmp  qword ptr [rsp+30h],3 
    13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
    13FE81C69 cmp  qword ptr [rsp+30h],4 
    13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
    

    A jump table based solution does not use comparison at all.

  3. Either not enough branches to cause the compiler to generate a jump table, or your compiler simply doesn't generate them. I'm not sure which.

EDIT 2014: There has been some discussion elsewhere from people familiar with the LLVM optimizer saying that the jump table optimization can be important in many scenarios; e.g. in cases where there is an enumeration with many values and many cases against values in said enumeration. That said, I stand by what I said above in 2011 -- too often I see people thinking "if I make it a switch, it'll be the same time no matter how many cases I have" -- and that's completely false. Even with a jump table you get the indirect jump cost and you pay for entries in the table for each case; and memory bandwidth is a Big Deal on modern hardware.

Write code for readability. Any compiler worth its salt is going to see an if / else if ladder and transform it into equivalent switch or vice versa if it would be faster to do so.

这篇关于“switch"比“if"快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆