为什么我不能在JVM 32和64位上观察到相同的性能提升? [英] Why can’t I observe the same performance improvement on both JVMs 32 and 64bits?

查看:160
本文介绍了为什么我不能在JVM 32和64位上观察到相同的性能提升?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试两种不同的方法( primes() primesOpt())来收集第一个N prime使用Java 8 IntStream 的数字。我从 Java 8 in Action 的第6章中采用了这些示例。你可以从这个要点 Primes.java 和这个的 pom.xml 使用Maven和JMH集成来构建它。 (您可以将 pom.xml 复制到项目文件夹,将 Primes.java 复制到 src \ main \ java'\\ _Drimes 并使用以下命令构建它: mvn clean install 。之后,您可以使用以下命令运行基准: java -jar target \benchmarks.jar )。

I was testing two different approaches (primes() and primesOpt()) to collect the first N prime numbers using the Java 8 IntStream. I took these examples from chapter 6 of Java 8 in Action. You can get the source code from this gist Primes.java and this pom.xml to build it with Maven and JMH integration. (you can copy pom.xml to project folder and Primes.java to src\main\java\primes and build it with the command: mvn clean install. After that you can run the benchmark with: java -jar target\benchmarks.jar).

第一个例子( primes() method)是一种简单的算法,用于将N个素数收集到 List< Integer> 中。第二个( primesOpt()方法)是一种增强方法,它只按先前的素数测​​试除法。

The first example (primes() method) is a straightforward algorithm to collect N prime numbers into a List<Integer>. And the second one (primesOpt() method) is an enhanced approach, which only tests divisions by previous prime numbers.

我用JMH测试两个实现来计算素数的 List< Integer> 到最大值10,000:

I test both implementations with JMH to calculate a List<Integer> of prime numbers to the maximum of 10,000:

@Benchmark
public int testPrimes() {
    return primes(10_000).size();
}

@Benchmark    
public int testPrimesOpt() {
    return primesOpt(10_000).size();
}

我根据JVM架构获得了不同的加速比。在JVM 64位中,我发现 primesOpt()比标准版 primes()加速25%,而对于JVM 32位,没有加速。

And I got different speedups depending on the JVM architectures. In JVM 64 bits I observe a speedup of 25% for primesOpt() over the standard version primes(), whereas for JVM 32 bits there is no speedup.

结果JRE 1.8.0_91-b14 64位:

Results for JRE 1.8.0_91-b14 64-Bit:

Benchmark              Mode  Cnt    Score    Error  Units
Primes.testPrimes     thrpt   50  269,278 ± 15,922  ops/s
Primes.testPrimesOpt  thrpt   50  341,861 ± 25,413  ops/s

结果JRE 1.8.0_91-b14 32位:

Results for JRE 1.8.0_91-b14 32-Bit:

Benchmark              Mode  Cnt    Score   Error  Units
Primes.testPrimes     thrpt  200  105,388 ± 2,741  ops/s
Primes.testPrimesOpt  thrpt  200  103,015 ± 2,035  ops/s

这些测试是在具有双核Intel I7 Cpu的机器上进行的,具有超线程,产生2个核心和4个硬件线程。此外,该系统有4GB的RAM。使用的JVM版本是1.8.0_91-b14,在Windows 7上运行。基准测试执行时的最小堆大小为1024MB(对应于 -Xms1024M )。在测量期间,没有其他活动正在运行。

These tests were performed on machine with a dual-core Intel I7 Cpu, with hyperthreading, resulting in 2 cores and 4 hardware threads. Furthermore, the system has 4GB of RAM. The JVM version used was the 1.8.0_91-b14, running on Windows 7. The benchmark was executed with the minimum heap size of 1024MB (corresponding to -Xms1024M). During the measurements there was no other activities running.

您是否知道为什么我不能在优化版本的JVM 32位上观察到相同的性能改进素数算法?

Do you have any idea why can’t I observe the same performance improvement on JVM 32 bits for the optimized version of the primes algorithm?

primes()方法实现:

public static boolean isPrime(int n) {
    int root = (int) Math.sqrt(n);
    return IntStream
        .rangeClosed(2, root)
        .noneMatch(div -> n%div == 0);
}
public static List<Integer> primes(int max) {
    return IntStream
        .range(2, max)
        .filter(Primes::isPrime)
        .collect(ArrayList::new, ArrayList::add, ArrayList::addAll);
}

primesOpt()方法实现:

public static boolean isPrimeOpt(List<Integer> primes, int n) {
    int root = (int) Math.sqrt(n);
    return takeWhile(primes, root)
        .stream()
        .noneMatch(div -> n%div == 0);
}

public static List<Integer> takeWhile(List<Integer> src, int max) {
    int i;
    for(i = 0; i < src.size() && src.get(i) <= max; i++) {}
    return src.subList(0, i);
}

public static List<Integer> primesOpt(int max) {
    ArrayList<Integer> res = new ArrayList<>();
    return IntStream
        .range(2, max)
        .filter(n -> Primes.isPrimeOpt(res, n))
        .collect(() -> res, ArrayList::add, (l1, l2) -> {});
}


推荐答案

我无法重现你的结果,但一般来说,性能可能会因环境因素而有很大差异。在您的代码中, takewhile 方法强制处理盒装 <$ em> c>整数值,而非优化变量 isPrime 仅处理 int 值。

I can’t reproduce your results, but generally, performance can differ significantly depending on environmental factors. In your code, the takewhile approach forces processing of boxed Integer values whereas the non-optimized variant isPrime is processing int values only.

这种权衡应该能够支付你要求的更多素数,即如果扫描第一个 10_000 数字显示矛盾的结果,请尝试 100_000 1_000_000 。拳击开销在最坏的情况下是线性的,一个不错的JVM可能会将其转换为亚线性甚至是恒定的开销,而将分割限制为实际质数的改进应该高于线性,因为素数的密度随着数字的增加而下降。

That trade-off should pay off the more primes you request, i.e. if scanning the first 10_000 numbers shows ambivalent results, try 100_000 or 1_000_000. The boxing overhead is linear at worst, a decent JVM might turn it into a sub-linear or even constant overhead, whereas the improvement of limiting the divisions to the actual primes should be higher than linear as the density of primes descends with higher numbers.

所以你使用的64Bit JVM在处理盒装值时可能会有更高的开销,但我认为,使用更高的 max进行测试还将显示优化变体的优势 - 除非JVM知道一个显着降低除法运算成本的技巧。

So perhaps the 64Bit JVM you have used has a higher overhead when processing boxed values, but I assume, that testing with a higher max will also reveal an advantage for your optimized variant—unless that JVM knows a trick to reduce the cost of division operations significantly.

但不应忽视优化变体在几个方面被打破。您正在通过供应商() - > res 收取这违反了合同,因为它没有在每次评估时返回一个新容器,并且收集器和使用的Predicate之间存在干扰在前面的过滤器步骤。

But it shouldn’t be ignored that your optimized variant is broken in several ways. You are passing the supplier () -> res to collect which violates the contract as it doesn’t return a new container on each evaluation and there’s an interference between the Collector and the Predicate used in the preceding filter step.

这表明尝试优化基于Stream的解决方案可能不会导致某处。与以下直接方法比较:

This suggest that trying to optimize your Stream based solution perhaps doesn’t lead to somewhere. Compare with the following straight-forward approach:

public static List<Integer> primesBest(int max) {
    BitSet prime=new BitSet();
    prime.set(1, max>>1);
    for(int i=3; i<max; i+=2)
        if(prime.get((i-1)>>1))
            for(int b=i*3; b<max; b+=i*2) prime.clear((b-1)>>1);
    return IntStream.concat( IntStream.of(2),
        prime.stream().map(i->i+i+1)).boxed().collect(Collectors.toList());
}

避免所有分区和装箱操作处于不使用Stream的劣势值选择的操作,但仅用于创建最终的 List< Integer> 。在我的机器上,它比 10_000 元素的优化变体快10倍, 1_000_000 元素快50倍。这表明大约10%,20%甚至因子2或3的性能差异都不值得讨论。

It avoids all division and boxing operations at the "disadvantage" of not using Stream operations for the value selection, but only for the creation of the final List<Integer>. On my machine, it is about 10 times faster than your optimized variant for 10_000 elements, 50 times faster for 1_000_000 elements. That suggests that performance differences in the order of 10%, 20% or even factor two or three are not worth any discussion.

我看不出这个算法怎么样尽管使用Stream API表达。最重要的是,不是每个操作都受益于Stream API。

I don’t see how this algorithm can be expressed using the Stream API though. The bottom line might be that not every operation benefits from the Stream API.

这篇关于为什么我不能在JVM 32和64位上观察到相同的性能提升?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆