具有恒定长度的System.arraycopy [英] System.arraycopy with constant length

查看:173
本文介绍了具有恒定长度的System.arraycopy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在玩JMH( http://openjdk.java.net/projects/code- tools / jmh / )我偶然发现了一个奇怪的结果。

I'm playing around with JMH ( http://openjdk.java.net/projects/code-tools/jmh/ ) and I just stumbled on a strange result.

我正在制作一个数组浅层副本的基准测试方法,我可以观察预期的结果(循环遍历数组是一个坏主意, #clone() System#arraycopy()之间没有显着差异数组#copyOf(),性能明智。

I'm benchmarking ways to make a shallow copy of an array and I can observe the expected results (that looping through the array is a bad idea and that there is no significant difference between #clone(), System#arraycopy() and Arrays#copyOf(), performance-wise).

除此之外当数组的长度是硬编码时,System#arraycopy()慢四分之一......等等,什么?怎么会这么慢?

Except that System#arraycopy() is one-quarter slower when the array's length is hard-coded... Wait, what ? How can this be slower ?

有没有人知道可能是什么原因?

Does anyone has an idea of what could be the cause ?

结果(吞吐量):

# JMH 1.11 (released 17 days ago)
# VM version: JDK 1.8.0_05, VM 25.5-b02
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/bin/java
# VM options: -Dfile.encoding=UTF-8 -Duser.country=FR -Duser.language=fr -Duser.variant
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

Benchmark                                            Mode  Cnt         Score         Error  Units
ArrayCopyBenchmark.ArraysCopyOf                     thrpt   20  67100500,319 ±  455252,537  ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class               thrpt   20  65246374,290 ±  976481,330  ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class_ConstantSize  thrpt   20  65068143,162 ± 1597390,531  ops/s
ArrayCopyBenchmark.ArraysCopyOf_ConstantSize        thrpt   20  64463603,462 ±  953946,811  ops/s
ArrayCopyBenchmark.Clone                            thrpt   20  64837239,393 ±  834353,404  ops/s
ArrayCopyBenchmark.Loop                             thrpt   20  21070422,097 ±  112595,764  ops/s
ArrayCopyBenchmark.Loop_ConstantSize                thrpt   20  24458867,274 ±  181486,291  ops/s
ArrayCopyBenchmark.SystemArrayCopy                  thrpt   20  66688368,490 ±  582416,954  ops/s
ArrayCopyBenchmark.SystemArrayCopy_ConstantSize     thrpt   20  48992312,357 ±  298807,039  ops/s

基准类:

import java.util.Arrays;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public class ArrayCopyBenchmark {

    private static final int LENGTH = 32;

    private Object[] array;

    @Setup
    public void before() {
        array = new Object[LENGTH];
        for (int i = 0; i < LENGTH; i++) {
            array[i] = new Object();
        }
    }

    @Benchmark
    public Object[] Clone() {
        Object[] src = this.array;
        return src.clone();
    }

    @Benchmark
    public Object[] ArraysCopyOf() {
        Object[] src = this.array;
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public Object[] ArraysCopyOf_ConstantSize() {
        Object[] src = this.array;
        return Arrays.copyOf(src, LENGTH);
    }

    @Benchmark
    public Object[] ArraysCopyOf_Class() {
        Object[] src = this.array;
        return Arrays.copyOf(src, src.length, Object[].class);
    }

    @Benchmark
    public Object[] ArraysCopyOf_Class_ConstantSize() {
        Object[] src = this.array;
        return Arrays.copyOf(src, LENGTH, Object[].class);
    }

    @Benchmark
    public Object[] SystemArrayCopy() {
        Object[] src = this.array;
        int length = src.length;
        Object[] array = new Object[length];
        System.arraycopy(src, 0, array, 0, length);
        return array;
    }

    @Benchmark
    public Object[] SystemArrayCopy_ConstantSize() {
        Object[] src = this.array;
        Object[] array = new Object[LENGTH];
        System.arraycopy(src, 0, array, 0, LENGTH);
        return array;
    }

    @Benchmark
    public Object[] Loop() {
        Object[] src = this.array;
        int length = src.length;
        Object[] array = new Object[length];
        for (int i = 0; i < length; i++) {
            array[i] = src[i];
        }
        return array;
    }

    @Benchmark
    public Object[] Loop_ConstantSize() {
        Object[] src = this.array;
        Object[] array = new Object[LENGTH];
        for (int i = 0; i < LENGTH; i++) {
            array[i] = src[i];
        }
        return array;
    }
}


推荐答案

As通常,通过研究生成的代码可以快速回答这些问题。 JMH在Linux上为您提供 -prof perfasm ,在Windows上为 -prof xperfasm 提供。如果您在JDK 8u40上运行基准测试,那么您将看到(注意我使用 -bm avgt -tu ns 以使分数更易于理解):

As usual, these kind of questions are quickly answered by studying the generated code. JMH provides you with -prof perfasm on Linux, and -prof xperfasm on Windows. If you run the benchmark on JDK 8u40, then you will see (note I used -bm avgt -tu ns to make scores more comprehensible):

Benchmark                         Mode  Cnt   Score   Error  Units
ACB.SystemArrayCopy               avgt   25  13.294 ± 0.052  ns/op
ACB.SystemArrayCopy_ConstantSize  avgt   25  16.413 ± 0.080  ns/op

为什么这些基准测试的表现不同?让我们先做 -prof perfnorm 进行剖析(我删掉了无关紧要的行):

Why are these benchmarks perform differently? Let's first do -prof perfnorm to dissect (I dropped the lines that do not matter):

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   25   13.466 ±  0.070  ns/op
ACB.SAC:·CPI                                  avgt    5    0.602 ±  0.025   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.346 ±  0.239   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   24.756 ±  1.438   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.404 ±  0.129   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   14.929 ±  0.230   #/op
ACB.SAC:·LLC-loads                            avgt    5    2.151 ±  0.217   #/op
ACB.SAC:·branches                             avgt    5   17.795 ±  1.003   #/op
ACB.SAC:·cycles                               avgt    5   56.677 ±  3.187   #/op
ACB.SAC:·instructions                         avgt    5   94.145 ±  6.442   #/op

ACB.SAC_ConstantSize                          avgt   25   16.447 ±  0.084  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.637 ±  0.016   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.357 ±  0.206   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   25.611 ±  1.482   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.368 ±  0.123   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   25.593 ±  1.610   #/op
ACB.SAC_ConstantSize:·LLC-loads               avgt    5    1.050 ±  0.038   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   17.853 ±  0.697   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   66.680 ±  2.049   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5  104.759 ±  4.831   #/op

所以, ConstantSize 以某种方式做了更多的L1-dcache-store,但少了一个LLC-load。嗯,这就是我们正在寻找的,在不变的情况下更多的商店。 -prof perfasm 方便地突出显示汇编中的热门部分:

So, ConstantSize somehow does more L1-dcache-stores, but one less LLC-load. Hm, so that's what we are looking for, more stores in the constant case. -prof perfasm conveniently highlights the hot parts in assembly:

default

  4.32%    6.36%   0x00007f7714bda2dc: movq   $0x1,(%rax)            ; alloc
  0.09%    0.04%   0x00007f7714bda2e3: prefetchnta 0x100(%r9)
  2.95%    1.48%   0x00007f7714bda2eb: movl   $0xf80022a9,0x8(%rax)
  0.38%    0.18%   0x00007f7714bda2f2: mov    %r11d,0xc(%rax)
  1.56%    3.02%   0x00007f7714bda2f6: prefetchnta 0x140(%r9)
  4.73%    2.71%   0x00007f7714bda2fe: prefetchnta 0x180(%r9)

ConstantSize

  0.58%    1.22%   0x00007facf921132b: movq   $0x1,(%r14)            ; alloc
  0.84%    0.72%   0x00007facf9211332: prefetchnta 0xc0(%r10)
  0.11%    0.13%   0x00007facf921133a: movl   $0xf80022a9,0x8(%r14)
  0.21%    0.68%   0x00007facf9211342: prefetchnta 0x100(%r10)
  0.50%    0.87%   0x00007facf921134a: movl   $0x20,0xc(%r14)
  0.53%    0.82%   0x00007facf9211352: mov    $0x10,%ecx
  0.04%    0.14%   0x00007facf9211357: xor    %rax,%rax
  0.34%    0.76%   0x00007facf921135a: shl    $0x3,%rcx
  0.50%    1.17%   0x00007facf921135e: rex.W rep stos %al,%es:(%rdi) ; zeroing
 29.49%   52.09%   0x00007facf9211361: prefetchnta 0x140(%r10)
  1.03%    0.53%   0x00007facf9211369: prefetchnta 0x180(%r10)  

所以有一个讨厌的 rex.W rep stos%al,%es:(%rdi)消耗了大量时间。这会将新分配的数组归零。在 ConstantSize 测试中,JVM无法与您覆盖整个目标数组相关联,因此在深入实际数组副本之前必须将其预先归零。

So there is that pesky rex.W rep stos %al,%es:(%rdi) consuming a significant time. This zeroes the newly allocated array. In ConstantSize test, the JVM could not correlate that you are overwriting the entire target array, and so it had to pre-zero it before diving into the actual array copy.

如果您查看JDK 9b82上生成的代码(最新版本),那么您将看到它将非归零副本中的两种模式折叠起来,就像您可以看到 -prof perfasm ,也可以用 -prof确认perfnorm

If you look at the generated code on JDK 9b82 (the latest available), then you will see it folds both patterns in non-zeroed copy, as you can see with -prof perfasm, and can also confirm with -prof perfnorm:

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   50   14.156 ±  0.492  ns/op
ACB.SAC:·CPI                                  avgt    5    0.612 ±  0.144   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.363 ±  0.341   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   28.350 ±  2.181   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.287 ±  0.607   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   16.922 ±  3.402   #/op
ACB.SAC:·branches                             avgt    5   21.242 ±  5.914   #/op
ACB.SAC:·cycles                               avgt    5   67.168 ± 20.950   #/op
ACB.SAC:·instructions                         avgt    5  109.931 ± 35.905   #/op

ACB.SAC_ConstantSize                          avgt   50   13.763 ±  0.067  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.625 ±  0.024   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.376 ±  0.214   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   28.285 ±  2.127   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.335 ±  0.223   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   16.926 ±  1.467   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   19.469 ±  0.869   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   62.395 ±  3.898   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5   99.891 ±  5.435   #/op

当然,所有这些用于阵列复制的纳米标记都容易受到矢量化复制存根中奇怪的对齐引起的性能差异的影响,但这是另一个(恐怖)故事,我没有勇气说出来。

Of course, all these nanobenchmarks for arraycopy are susceptible for weird alignment-induced performance differences in the vectorized copying stubs, but that's another (horror) story, that I don't have courage to tell.

这篇关于具有恒定长度的System.arraycopy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆