强制使用tableswitch而不是lookupswitch [英] Force tableswitch instead of lookupswitch
问题描述
Scala 2.11将相对密集的 Int
范围内的 match
表达式编译为 lookupswitch
:
lookupswitch {//21-12:200-11:200-10:184-9:190-8:190-7:190-6:190-5:190-4:200-1:2002:1953:1954:1955:1956:1847:18412:18413:18418:18421:18425:184默认值:180}
Java 7将等效的 switch
语句编译为 tableswitch
:
tableswitch {//-12至25-12:168-11:168-10:177-9:174-8:174-7:174-6:174-5:174-4:168-3:185-2:185-1:1680:1851:1852:1713:1714:1715:1716:1777:1778:1859:18510:18511:18512:18113:18114:18515:18516:18517:18518:18119:18520:18521:18122:18523:18524:18525:181默认值:185}
是否有某种方法也可以强制Scala生成 tableswitch
?
您不必在意字节码,因为现代JVM足够聪明,可以同时编译 lookupswitch
和 tableswitch
以类似的有效方式.
直观地 tableswitch
应该更快,这也由以下建议:
组装
让我们通过查看生成的汇编代码来验证理论.
以下JVM选项将有所帮助: -XX:CompileCommand = print,bench.Switch :: *
#{方法} {0x0000000017498a48}'bench/Switch'中的'lookupSwitch''(I)I'#parm0:rdx =整数#[sp + 0x20](呼叫方的sp)0x000000000329b240:低于$ 0x18,%rsp0x000000000329b247:mov%rbp,0x10(%rsp); *同步条目;-Bench.Switch :: lookupSwitch @ -10x000000000329b24c:cmp $ 0x4,%edx0x000000000329b24f:je 0x000000000329b2a50x000000000329b251:cmp $ 0x4,%edx0x000000000329b254:jg 0x000000000329b2810x000000000329b256:cmp $ 0x2,%edx0x000000000329b259:je 0x000000000329b27a0x000000000329b25b:cmp $ 0x2,%edx0x000000000329b25e:jg 0x000000000329b2730x000000000329b260:cmp $ 0x1,%edx0x000000000329b263:jne 0x000000000329b26c; * lookupswitch;-Bench.Switch :: lookupSwitch @ 1...
我们在这里看到的是从中间值4开始的二进制搜索(这解释了为什么案例4在上图中具有最佳性能).
但是有趣的是, tableSwitch
的编译方式完全相同!
#{方法} {0x0000000017528b18}'bench/Switch'中的'tableSwitch''(I)I'#parm0:rdx =整数#[sp + 0x20](呼叫方的sp)0x000000000332c280:sub $ 0x18,%rsp0x000000000332c287:mov%rbp,0x10(%rsp); *同步条目;-bench.Switch :: tableSwitch @ -10x000000000332c28c:cmp $ 0x4,%edx0x000000000332c28f:je 0x000000000332c2e50x000000000332c291:cmp $ 0x4,%edx0x000000000332c294:jg 0x000000000332c2c10x000000000332c296:cmp $ 0x2,%edx0x000000000332c299:je 0x000000000332c2ba0x000000000332c29b:cmp $ 0x2,%edx0x000000000332c29e:jg 0x000000000332c2b30x000000000332c2a0:cmp $ 0x1,%edx0x000000000332c2a3:jne 0x000000000332c2ac; *表开关;-bench.Switch :: tableSwitch @ 1...
跳转表
但是等等……为什么要二进制搜索,而不是跳转表?
HotSpot JVM具有启发式功能,只能为10个以上案例的交换机生成跳转表.可以通过选项 -XX:MinJumpTableSize =
进行更改.
好的,让我们用更多标签扩展测试用例,然后再次检查生成的代码.
#{方法} {0x0000000017288a68}'bench/Switch'中的'lookupSwitch''(I)I'#parm0:rdx =整数#[sp + 0x20](呼叫方的sp)0x000000000307ecc0:子$ 0x18,%rsp;{no_reloc}0x000000000307ecc7:mov%rbp,0x10(%rsp); *同步条目;-Bench.Switch :: lookupSwitch @ -10x000000000307eccc:mov%edx,%r10d0x000000000307eccf:dec%r10d0x000000000307ecd2:cmp $ 0xa,%r10d0x000000000307ecd6:jb 0x000000000307ece90x000000000307ecd8:移动$ 0xffffffff,%eax0x000000000307ecdd:加$ 0x10,%rsp0x000000000307ece1:弹出%rbp0x000000000307ece2:测试%eax,-0x1faece8(%rip)#0x00000000010d0000;{poll_return}0x000000000307ece8:retq0x000000000307ece9:movslq%edx,%r100x000000000307ecec:movabs $ 0x307ec60,%r11;{section_word}0x000000000307ecf6:jmpq * -0x8(%r11,%r10,8); * lookupswitch;-Bench.Switch :: lookupSwitch @ 1^^^^^^^^^^^^^^^^^^^^^^^^^^
是的!这是我们计算出的跳转指令.请注意,这是为 lookupswitch
生成的.但是 tableswitch
将完全相同.
令人惊讶的是,HotSpot JVM甚至可以为带有间隙和异常值的交换机生成跳转表. -XX:MaxJumpTableSparseness
控制差距有多大.例如.如果标签的范围是1到10,则标签的范围是13到20,最后一个标签的值为99-JIT将为值99生成保护测试,其余标签将创建一个表.
源代码
HotSpot源代码将最终说服 tableswitch
用C2进行JIT编译后.这基本上是因为两个指令的解析都以对同一 解决方案
You should not care about the bytecode, since modern JVMs are smart enough to compile both lookupswitch
and tableswitch
in a similarly efficient way.
Intuitively tableswitch
should be faster, and this is also suggested by
JVM specification:
Thus, a tableswitch instruction is probably more efficient than a lookupswitch where space considerations permit a choice.
However, the spec was written 20+ years ago, when JVM had no JIT compiler. Is there a performance difference in a modern HotSpot JVM?
The benchmark
package bench;
import org.openjdk.jmh.annotations.*;
@State(Scope.Benchmark)
public class SwitchBench {
@Param({"1", "2", "3", "4", "5", "6", "7", "8"})
int n;
@Benchmark
public long lookupSwitch() {
return Switch.lookupSwitch(n);
}
@Benchmark
public long tableSwitch() {
return Switch.tableSwitch(n);
}
}
To have precise control over the bytecode, I build Switch
class with Jasmin.
.class public bench/Switch
.super java/lang/Object
.method public static lookupSwitch(I)I
.limit stack 1
iload_0
lookupswitch
1 : One
2 : Two
3 : Three
4 : Four
5 : Five
6 : Six
7 : Seven
default : Other
One:
bipush 11
ireturn
Two:
bipush 22
ireturn
Three:
bipush 33
ireturn
Four:
bipush 44
ireturn
Five:
bipush 55
ireturn
Six:
bipush 66
ireturn
Seven:
bipush 77
ireturn
Other:
bipush -1
ireturn
.end method
.method public static tableSwitch(I)I
.limit stack 1
iload_0
tableswitch 1
One
Two
Three
Four
Five
Six
Seven
default : Other
One:
bipush 11
ireturn
Two:
bipush 22
ireturn
Three:
bipush 33
ireturn
Four:
bipush 44
ireturn
Five:
bipush 55
ireturn
Six:
bipush 66
ireturn
Seven:
bipush 77
ireturn
Other:
bipush -1
ireturn
.end method
The results show no performance difference between lookupswitch/tableswitch benchmarks, but there is a small variation depending on switch argument:
Assembly
Let's verify the theory by looking at generated assembly code.
The following JVM option will help: -XX:CompileCommand=print,bench.Switch::*
# {method} {0x0000000017498a48} 'lookupSwitch' '(I)I' in 'bench/Switch'
# parm0: rdx = int
# [sp+0x20] (sp of caller)
0x000000000329b240: sub $0x18,%rsp
0x000000000329b247: mov %rbp,0x10(%rsp) ;*synchronization entry
; - bench.Switch::lookupSwitch@-1
0x000000000329b24c: cmp $0x4,%edx
0x000000000329b24f: je 0x000000000329b2a5
0x000000000329b251: cmp $0x4,%edx
0x000000000329b254: jg 0x000000000329b281
0x000000000329b256: cmp $0x2,%edx
0x000000000329b259: je 0x000000000329b27a
0x000000000329b25b: cmp $0x2,%edx
0x000000000329b25e: jg 0x000000000329b273
0x000000000329b260: cmp $0x1,%edx
0x000000000329b263: jne 0x000000000329b26c ;*lookupswitch
; - bench.Switch::lookupSwitch@1
...
What we see here is a binary search starting with a mid value 4 (this explains why case 4 has the best performance in the graph above).
But the interesting thing is that tableSwitch
is compiled exactly the same way!
# {method} {0x0000000017528b18} 'tableSwitch' '(I)I' in 'bench/Switch'
# parm0: rdx = int
# [sp+0x20] (sp of caller)
0x000000000332c280: sub $0x18,%rsp
0x000000000332c287: mov %rbp,0x10(%rsp) ;*synchronization entry
; - bench.Switch::tableSwitch@-1
0x000000000332c28c: cmp $0x4,%edx
0x000000000332c28f: je 0x000000000332c2e5
0x000000000332c291: cmp $0x4,%edx
0x000000000332c294: jg 0x000000000332c2c1
0x000000000332c296: cmp $0x2,%edx
0x000000000332c299: je 0x000000000332c2ba
0x000000000332c29b: cmp $0x2,%edx
0x000000000332c29e: jg 0x000000000332c2b3
0x000000000332c2a0: cmp $0x1,%edx
0x000000000332c2a3: jne 0x000000000332c2ac ;*tableswitch
; - bench.Switch::tableSwitch@1
...
Jump table
But wait... Why binary search, not a jump table?
HotSpot JVM has an heuristics to generate a jump table only for switches with 10+ cases. This can be altered by the option -XX:MinJumpTableSize=
.
OK, let's extend our test case with some more labels and check the generated code once again.
# {method} {0x0000000017288a68} 'lookupSwitch' '(I)I' in 'bench/Switch'
# parm0: rdx = int
# [sp+0x20] (sp of caller)
0x000000000307ecc0: sub $0x18,%rsp ; {no_reloc}
0x000000000307ecc7: mov %rbp,0x10(%rsp) ;*synchronization entry
; - bench.Switch::lookupSwitch@-1
0x000000000307eccc: mov %edx,%r10d
0x000000000307eccf: dec %r10d
0x000000000307ecd2: cmp $0xa,%r10d
0x000000000307ecd6: jb 0x000000000307ece9
0x000000000307ecd8: mov $0xffffffff,%eax
0x000000000307ecdd: add $0x10,%rsp
0x000000000307ece1: pop %rbp
0x000000000307ece2: test %eax,-0x1faece8(%rip) # 0x00000000010d0000
; {poll_return}
0x000000000307ece8: retq
0x000000000307ece9: movslq %edx,%r10
0x000000000307ecec: movabs $0x307ec60,%r11 ; {section_word}
0x000000000307ecf6: jmpq *-0x8(%r11,%r10,8) ;*lookupswitch
; - bench.Switch::lookupSwitch@1
^^^^^^^^^^^^^^^^^^^^^^^^^
Yes! Here is our computed jump instruction. Note, this is generated for lookupswitch
. But there will be exactly the same one for tableswitch
.
Amazingly, HotSpot JVM can generate jump tables even for switches with gaps and with outliers. -XX:MaxJumpTableSparseness
controls how large the gaps can be. E.g. if there are labels from 1 to 10, then from 13 to 20 and the last label with the value 99 - JIT will generate a guard test for the value 99, and for the rest labels it will create a table.
Source code
HotSpot source code will finally convince there should be no performance difference between lookupswitch
and tableswitch
after a method is JIT-compiled with C2. That's basically because the parsing of both instructions ends up with a call to the same jump_switch_ranges
function that works for an arbitrary set of labels.
Conclusion
As we saw, HotSpot JVM may compile tableswitch
using a binary search and lookupswitch
using a jump table, or vice versa. This depends on the number and the density of labels, but not on the bytecode itself.
So, answering your original question - you don't need to!
这篇关于强制使用tableswitch而不是lookupswitch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!