什么可以解释写入堆位置引用的巨大性能损失? [英] What can explain the huge performance penalty of writing a reference to a heap location?

查看:90
本文介绍了什么可以解释写入堆位置引用的巨大性能损失?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在调查世代垃圾收集者对应用程序性能的微妙后果时,我发现非常基本的操作性能出现了相当惊人的差异 - ndash;简单写入堆位置 - ndash;关于写入的值是基元还是引用。

microbenchmark



  @OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 1,time = 1)
@Measurement(iterations = 3,time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
公共类写入
{
static final int TARGET_SIZE = 1024;

static final int [] primitiveArray = new int [TARGET_SIZE];
static final Object [] referenceArray = new Object [TARGET_SIZE];

int val = 1;
@GenerateMicroBenchmark
public void fillPrimitiveArray(){
final int primitiveValue = val ++;
for(int i = 0; i< TARGET_SIZE; i ++)
primitiveArray [i] = primitiveValue;
}

@GenerateMicroBenchmark
public void fillReferenceArray(){
final Object referenceValue = new Object();
for(int i = 0; i< TARGET_SIZE; i ++)
referenceArray [i] = referenceValue;
}
}



结果



 基准模式Thr Cnt Sec平均值平均误差单位
fillPrimitiveArray avgt 1 6 1 87.891 1.610 nsec / op
fillReferenceArray avgt 1 6 1 640.287 8.368 nsec / op

由于整个循环几乎慢了8倍,本身可能慢10倍以上。什么可能解释这种放缓?



写出原始数组的速度超过每纳秒10次写入。也许我应该问一个问题的另一面:是什么让原始写作如此快速? (顺便说一下,我已经检查过,时间与数组大小成线性比例关系。)



请注意,这全是单线程的;指定 @Threads(2)会增加两项测量值,但比值相似。




一些背景:卡表和相关的写屏障



年轻一代中的对象可能碰巧只能从老一代中的某个对象到达。为避免收集活体物体,YG收集器必须知道自上次YG收集以来写入旧世代区域的任何参考文献。这是通过一种称为卡表的脏标志表来实现的,该表对每个512字节的堆有一个标志。

这个方案的丑陋部分来自于我们意识到每一次引用的写入都必须伴随着一个表格不变 - 维护一块代码:卡表中保护正在写入的地址的位置必须标记为。这段代码被称为写屏障



在特定的机器代码中,这看起来如下所示:

  lea edx,[edi + ebp * 4 + 0x10];计算堆位置以写入
mov [edx],ebx;将值写入堆位置
shr edx,9;计算卡表中的偏移量
mov [ecx + edx],ah;将卡表条目标记为脏

这就是所有需要进行相同的高级操作时所写的值是原始的:

  mov [edx + ebx * 4 + 0x10],ebp 

写入屏障似乎只是再写一次,但是我的测量结果显示它会导致数量级减速即可。我无法解释这一点。



UseCondCardMark 让情况变得更糟



有一个非常隐蔽的JVM标志,如果条目已经被标记为脏,应该避免卡表写入。这主要在一些退化的情况下非常重要,在这种情况下,大量卡片表格写入通过CPU高速缓存导致线程之间虚假共享。无论如何,我试着用这个标志:

 与-XX:+ UseCondCardMark:
基准模式Thr Cnt Sec Mean平均误差单位
fillPrimitiveArray avgt 1 6 1 89.913 3.586 nsec / op
fillReferenceArray avgt 1 6 1 1504.123 12.130 nsec / op


解决方案

引用Vladimir Kozlov提供的权威答案 hotspot-compiler-dev 邮件列表:




您好Marko,

对于原始数组,我们使用使用XMM
寄存器的手写汇编代码作为初始化矢量。对于对象数组,我们没有对
进行优化,因为它不是常见的情况。我们可以改进它类似于
我们为arracopy做了什么,但我们决定暂时离开。



问候,

Vladimir
p>

我也想知道为什么优化的代码没有内联,并且也得到了答案:


代码不小,所以我们决定不内联。在macroAssembler_x86.cpp中查看
MacroAssembler :: generate_fill():

http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file /






h3>我的原始答案:

我错过了机器代码中的一个重要部分,显然是因为我正在查看编译方法的On-Stack Replacement版本,而不是一个用于后续调用。事实证明,HotSpot能够证明我的循环相当于调用 Arrays.fill 会完成的任务,并用替换整个循环。对这样的代码调用指令。我看不到该函数的代码,但它可能会使用每种可能的技巧,例如MMX指令,以相同的32位值填充一块内存。



这给了我一个想法来衡量实际的 Arrays.fill 调用。我得到了更多的惊喜:
$ b $ pre $ lt; code>基准模式Thr Cnt Sec平均值平均误差单位
fillPrimitiveArray avgt 1 5 2 155.343 1.318 nsec / op
fillReferenceArray avgt 1 5 2 682.975 17.990 nsec / op
loopFillPrimitiveArray avgt 1 5 2 156.114 0.523 nsec / op
loopFillReferenceArray avgt 1 5 2 682.209 7.047 nsec / op

使用循环和调用 fill 的结果是相同的。如果有的话,这比引发问题的结果更令人困惑。我会至少希望 fill 受益于相同的优化创意,无论其值类型如何。


While investigating the subtler consequences of generational garbage collectors on application performance, I have hit a quite staggering discrepancy in the performance of a very basic operation – a simple write to a heap location – with respect to whether the value written is primitive or a reference.

The microbenchmark

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 1, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Writing
{
  static final int TARGET_SIZE = 1024;

  static final    int[] primitiveArray = new    int[TARGET_SIZE];
  static final Object[] referenceArray = new Object[TARGET_SIZE];

  int val = 1;
  @GenerateMicroBenchmark
  public void fillPrimitiveArray() {
    final int primitiveValue = val++;
    for (int i = 0; i < TARGET_SIZE; i++)
      primitiveArray[i] = primitiveValue;
  }

  @GenerateMicroBenchmark
  public void fillReferenceArray() {
    final Object referenceValue = new Object();
    for (int i = 0; i < TARGET_SIZE; i++)
      referenceArray[i] = referenceValue;
  }
}

The results

Benchmark              Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray     avgt   1      6    1       87.891        1.610  nsec/op
fillReferenceArray     avgt   1      6    1      640.287        8.368  nsec/op

Since the whole loop is almost 8 times slower, the write itself is probably more than 10 times slower. What could possibly explain such a slowdown?

The speed of writing out the primitive array is more than 10 writes per nanosecond. Perhaps I should ask the flip-side of my question: what makes primitive writing so fast? (BTW I've checked, the times scale linearly with array size.)

Note that this is all single-threaded; specifying @Threads(2) will increase both measurements, but the ratio will be similar.


A bit of background: the card table and the associated write barrier

An object in the Young Generation could happen to be reachable only from an object in the Old Generation. To avoid collecting live objects, the YG collector must know about any references that were written to the Old Generation area since the last YG collection. This is achieved with a sort of "dirty flag table", called the card table, which has one flag for each block of 512 bytes of heap.

The "ugly" part of the scheme comes when we realize that each and every write of a reference must be accompanied by a card table invariant-maintaining piece of code: the location in the card table which guards the address being written to must be marked as dirty. This piece of code is termed the write barrier.

In specific machine code, this looks as follows:

lea   edx, [edi+ebp*4+0x10]   ; calculate the heap location to write
mov   [edx], ebx              ; write the value to the heap location
shr   edx, 9                  ; calculate the offset into the card table
mov   [ecx+edx], ah           ; mark the card table entry as dirty

And this is all it takes for the same high-level operation when the value written is primitive:

mov   [edx+ebx*4+0x10], ebp

The write barrier appears to contribute "just" one more write, but my measurements show that it causes an order-of-magnitude slowdown. I can't explain this.

UseCondCardMark just makes it worse

There is a quite obscure JVM flag which is supposed to avoid the card table write if the entry is already marked dirty. This is important primarily in some degenerate cases where a lot of card table writing causes false sharing between threads via CPU caches. Anyway, I tried with that flag on:

with  -XX:+UseCondCardMark:
Benchmark              Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray     avgt   1      6    1       89.913        3.586  nsec/op
fillReferenceArray     avgt   1      6    1     1504.123       12.130  nsec/op

解决方案

Quoting the authoritative answer provided by Vladimir Kozlov at hotspot-compiler-dev mailing list:

Hi Marko,

For primitive arrays we use handwritten assembler code which use XMM registers as vectors for initialization. For object arrays we did not optimize it because it is not common case. We can improve it similar to what we did for arracopy but we decided leave it for now.

Regards,
Vladimir

I have also wondered why the optimized code is not inlined, and got that answer as well:

The code is not small, so we decided to not inline it. Look on MacroAssembler::generate_fill() in macroAssembler_x86.cpp:

http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/54f0c207dc35/src/cpu/x86/vm/macroAssembler_x86.cpp


My original answer:

I missed an important bit in the machine code, apparently because I was looking at the On-Stack Replacement version of the compiled method instead of the one used for subsequent calls. It turns out that HotSpot was able to prove that my loop amounts to what a call to Arrays.fill would have done and replaced the entire loop with a call instruction to such code. I can't see that function's code, but it probably uses every possible trick, such as MMX instructions, to fill a block of memory with the same 32-bit value.

This gave me the idea to measure the actual Arrays.fill calls. I got more surprise:

Benchmark                  Mode Thr    Cnt  Sec         Mean   Mean error    Units
fillPrimitiveArray         avgt   1      5    2      155.343        1.318  nsec/op
fillReferenceArray         avgt   1      5    2      682.975       17.990  nsec/op
loopFillPrimitiveArray     avgt   1      5    2      156.114        0.523  nsec/op
loopFillReferenceArray     avgt   1      5    2      682.209        7.047  nsec/op

The results with a loop and with a call to fill are identical. If anything, this is even more confusing than the results which motivated the question. I would have at least expected fill to benefit from the same optimization ideas regardless of value type.

这篇关于什么可以解释写入堆位置引用的巨大性能损失?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆