CPU的div指令和HotSpot的JIT代码之间存在较大的性能差距 [英] Large performance gap between CPU's div instruction and HotSpot's JIT code

查看:86
本文介绍了CPU的div指令和HotSpot的JIT代码之间存在较大的性能差距的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

自CPU问世以来,众所周知整数除法指令非常昂贵.我去看看今天,在拥有数十亿个晶体管的CPU上,情况有多糟.我发现对于常数除数,硬件idiv指令的性能仍然比不包含idiv指令的JIT编译器能够发出的代码差得多.

Since the beginning of CPUs it has been general knowledge that the integer division instruction is expensive. I went to see how bad it is today, on CPUs which have the luxury of billions of transistors. I found that the hardware idiv instruction still performs significantly worse for constant divisors than the code the JIT compiler is able to emit, which doesn't contain the idiv instruction.

为了在专用的微基准测试中实现这一点,我编写了以下内容:

To bring this out in a dedicated microbenchmark I've written the following:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@OperationsPerInvocation(MeasureDiv.ARRAY_SIZE)
@Warmup(iterations = 8, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class MeasureDiv
{
  public static final int ARRAY_SIZE = 128;
  public static final long DIVIDEND_BASE = 239520948509234807L;
  static final int DIVISOR = 10;
  final long[] input = new long[ARRAY_SIZE];

  @Setup(Level.Iteration) public void setup() {
    for (int i = 0; i < input.length; i++) {
      input[i] = DIVISOR;
    }
  }

  @Benchmark public long divVar() {
    long sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) {
      final long in = input[i];
      final long dividend = DIVIDEND_BASE + i;
      final long divisor = in;
      final long quotient = dividend / divisor;
      sum += quotient;
    }
    return sum;
  }

  @Benchmark public long divConst() {
    long sum = 0;
    for (int i = 0; i < ARRAY_SIZE; i++) {
      final long in = input[i];
      final long dividend = DIVIDEND_BASE + in;
      final int divisor = DIVISOR;
      final long quotient = dividend / divisor;
      sum += quotient;
    }
    return sum;
  }
}

简而言之,我有两种方法在各个方面都相同,不同的是,一个方法(divVar)执行从数组中读出的数字的除法,而另一个方法由编译时常数进行的除法.这些是结果:

In a nutshell, I have two methods identical in every respect except that one (divVar) performs division by a number read off an array while the other divides by a compile-time constant. These are the results:

Benchmark            Mode  Cnt  Score   Error  Units
MeasureDiv.divConst  avgt    5  1.228 ± 0.032  ns/op
MeasureDiv.divVar    avgt    5  8.913 ± 0.192  ns/op

性能比非常出色.我的期望是,现代的英特尔处理器具有足够的空间,并且其工程师也有足够的兴趣,可以在硬件中实现复杂但高效的除法算法.然而,JIT编译器通过向其发送执行相同任务的其他一些指令流而击败了英特尔,速度仅快7倍.如果有的话,专用微码应该比JIT通过汇编指令的公共API能够更好地利用CPU.

The performance ratio is quite extraordinary. My expectation would be that a modern Intel processor has enough real estate, and its engineers have enough interest, to implement a complex but performant division algorithm in hardware. Yet the JIT compiler beats Intel by sending it a stream of some other instructions which perform the same job, just seven times faster. If anything, dedicated microcode should be able to utilize the CPU even better than what JIT can do through the public API of assembly instructions.

idiv为何仍然慢很多,基本限制是什么?

How come idiv is still so much slower, what is the fundamental limitation?

我想到的一个解释是,假设存在除法算法,该算法在处理过程中很晚才第一次涉及股息.然后,JIT编译器将具有领先优势,因为它会在编译时评估仅涉及除数的第一部分,并仅将算法的第二部分作为运行时代码发出.这个假设是真的吗?

One explanation that comes to mind is the hypothetical existence of a division algorithm which involves the dividend for the first time very late into the process. The JIT compiler would then have a head start because it would evaluate the first part which involves only the divisor at compile time and emit just the second part of the algorithm as runtime code. Is that hypothesis true?

推荐答案

正如用户 pvg 通过评论所解释的那样,假设的算法确实存在,并且是目前已知的最佳算法.该算法涉及在准备步骤中由同一除数进行除法,因此从整体上讲它是不可约的.经典出版物 Hacker's Delight 的第10章对此进行了介绍.

As explained by user pvg through the comments, the hypothesized algorithm is indeed in existence and the best one currently known. The algorithm involves division by the same divisor in the preparatory step, so it is fundamentally irreducible as a whole. It is covered in Chapter 10 of the classic publication Hacker's Delight.

这篇关于CPU的div指令和HotSpot的JIT代码之间存在较大的性能差距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆