为什么读取的易失性和写入字段构件是不可扩展的Java中？ [英] Why reading a volatile and writing to a field member is not scalable in Java?

查看：254 发布时间：2016/7/18 20:33:17 java assembly concurrency jvm

本文介绍了为什么读取的易失性和写入字段构件是不可扩展的Java中？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

观察用Java编写的下面的程序（完成可运行版本如下，但该方案的重要部分是在片断下面进一步有点）：

Observe the following program written in Java (complete runnable version follows, but the important part of the program is in the snippet a little bit further below):

import java.util.ArrayList;



/** A not easy to explain benchmark.
 */
class MultiVolatileJavaExperiment {

    public static void main(String[] args) {
        (new MultiVolatileJavaExperiment()).mainMethod(args);
    }

    int size = Integer.parseInt(System.getProperty("size"));
    int par = Integer.parseInt(System.getProperty("par"));

    public void mainMethod(String[] args) {
        int times = 0;
        if (args.length == 0) times = 1;
        else times = Integer.parseInt(args[0]);
        ArrayList < Long > measurements = new ArrayList < Long > ();

        for (int i = 0; i < times; i++) {
            long start = System.currentTimeMillis();
            run();
            long end = System.currentTimeMillis();

            long time = (end - start);
            System.out.println(i + ") Running time: " + time + " ms");
            measurements.add(time);
        }

        System.out.println(">>>");
        System.out.println(">>> All running times: " + measurements);
        System.out.println(">>>");
    }

    public void run() {
        int sz = size / par;
        ArrayList < Thread > threads = new ArrayList < Thread > ();

        for (int i = 0; i < par; i++) {
            threads.add(new Reader(sz));
            threads.get(i).start();
        }
        for (int i = 0; i < par; i++) {
            try {
                threads.get(i).join();
            } catch (Exception e) {}
        }
    }

    final class Foo {
        int x = 0;
    }

    final class Reader extends Thread {
        volatile Foo vfoo = new Foo();
        Foo bar = null;
        int sz;

        public Reader(int _sz) {
            sz = _sz;
        }

        public void run() {
            int i = 0;
            while (i < sz) {
                vfoo.x = 1;
                // with the following line commented
                // the scalability is almost linear
                bar = vfoo; // <- makes benchmark 2x slower for 2 processors - why?
                i++;
            }
        }
    }

}

说明：程序其实很简单。它加载的整数尺寸和参数从系统属性（通过使用 -D来JVM 标记） - 这是输入的长度和线程以后使用的数量。然后它分析的第一个命令行参数，它说了多少时间来重复程序（我们要肯定的是，JIT已完成其工作，有更可靠的测量）。

Explanation: The program is actually very simple. It loads integers size and par from the system properties (passed to jvm with the -D flag) - these are the input length and the number of threads to use later. It then parses the first command line argument which says how many time to repeat the program (we want to be sure that the JIT has done its work and have more reliable measurements).

的运行方法被调用每个重复。该方法只启动参数线程，每个线程都将做尺寸/ PAR 迭代循环。线状体在阅读类中定义。循环每次重复读取volatile成员 vfoo 和受让人 1 来它的公共领域。在此之后， vfoo 再次读取并分配到的非挥发性的字段栏

The run method is called in each repetition. This method simply starts par threads, each of which will do a loop with size / par iterations. The thread body is defined in the Reader class. Each repetition of the loop reads a volatile member vfoo and assigns 1 to its public field. After that, vfoo is read once again and assigned to a non-volatile field bar.

注意如何大多数的程序正在执行循环体，因此运行中的线程是这个基准的重点时间：

Notice how most of the time the program is executing the loop body, so the run in the thread is the focus of this benchmark:

    final class Reader extends Thread {
        volatile Foo vfoo = new Foo();
        Foo bar = null;
        int sz;

        public Reader(int _sz) {
            sz = _sz;
        }

        public void run() {
            int i = 0;
            while (i < sz) {
                vfoo.x = 1;
                // with the following line commented
                // the scalability is almost linear
                bar = vfoo; // <- makes benchmark 2x slower for 2 processors - why?
                i++;
            }
        }
    }

的意见：运行的java -Xmx512m -Xms512m -server -Dsize = 5亿-Dpar = 1 MultiVolatileJavaExperiment 10 上的

Ubuntu Server 10.04.3 LTS
8 core Intel(R) Xeon(R) CPU  X5355  @2.66GHz
~20GB ram
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

我得到以下时间：

I get the following times:

>>> All running times: [821, 750, 1011, 750, 758, 755, 1219, 751, 751, 1012]

现在，设置 -Dpar = 2 ，我得到：

>>> All running times: [1618, 380, 1476, 1245, 1390, 1391, 1445, 1393, 1511, 1508]

显然，这不能扩展出于某种原因 - 我本来期望的第二输出要快两倍（尽管它似乎是在早期的迭代之一 - 380ms ）。

有趣的是，注释掉行酒吧= vfoo （这甚至不应该成为性写），得出以下时间 -Dpar 设置为 1,2,4,8 。

Interestingly, commenting out the line bar = vfoo (which isn't even supposed to be a volatile write), yields the following times for -Dpar set to 1,2,4,8.

>>> All running times: [762, 563, 563, 563, 563, 563, 570, 566, 563, 563]
>>> All running times: [387, 287, 285, 284, 283, 281, 282, 282, 281, 282]
>>> All running times: [204, 146, 143, 142, 141, 141, 141, 141, 141, 141]
>>> All running times: [120, 78, 74, 74, 81, 75, 73, 73, 72, 71]

据扩展完美。

分析：首先，有没有垃圾回收周期这里存在的（我已经添加了 -verbose：GC ，以及检查这一点）。

Analysis: First of all, there are no garbage collection cycles occuring here (I've added -verbose:gc as well to check this).

我得到了类似的结果在我的iMac电脑。

I get similar results on my iMac.

每个线程写入其自己的领域，不同的富的对象属于不同的线程实例不会出现在同一个超高速缓存行要结束了 - 增加更多成员到富来增加它的大小不会改变的测量。每个线程对象的实例已经够多领域，填补了L1高速缓存行。因此，这可能不是一个内存问题。

Each thread is writing to its own field, and different Foo object instances belonging to different threads don't appear to be ending up in the same cachelines - adding more members into Foo to increase its size doesn't change the measurements. Each thread object instance has more than enough fields to fill up the L1 cache line. So this probably isn't a memory issue.

我的下一个想法是， JIT 可能会做一些奇怪的，因为早期的迭代通常的做的预期中取消注释版的规模，所以我检查了这一点，通过打印部件（请参阅如何做到这一点这个职位）。

My next thought was that the JIT might be doing something weird, because the early iterations usually do scale as expected in the uncommented version, so I checked this out by printing the assembly (see this post on how to do that).

java -Xmx512m -Xms512m -server -XX:CompileCommand=print,*Reader.run MultiVolatileJavaExperiment -Dsize=500000000 -Dpar=1 10

和我得到这2个输出为2个版本的即时编译方法运行在阅读。该评论（正常可扩展）版本：

and I get these 2 outputs for the 2 versions for the Jitted method run in Reader. The commented (properly scalable) version:

[Verified Entry Point]
  0xf36c9fac: mov    %eax,-0x3000(%esp)
  0xf36c9fb3: push   %ebp
  0xf36c9fb4: sub    $0x8,%esp
  0xf36c9fba: mov    0x68(%ecx),%ebx
  0xf36c9fbd: test   %ebx,%ebx
  0xf36c9fbf: jle    0xf36c9fec
  0xf36c9fc1: xor    %ebx,%ebx
  0xf36c9fc3: nopw   0x0(%eax,%eax,1)
  0xf36c9fcc: xchg   %ax,%ax
  0xf36c9fd0: mov    0x6c(%ecx),%ebp
  0xf36c9fd3: test   %ebp,%ebp
  0xf36c9fd5: je     0xf36c9ff7
  0xf36c9fd7: movl   $0x1,0x8(%ebp)

---------------------------------------------

  0xf36c9fde: mov    0x68(%ecx),%ebp
  0xf36c9fe1: inc    %ebx               ; OopMap{ecx=Oop off=66}
                                        ;*goto
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@21 (line 83)

---------------------------------------------

  0xf36c9fe2: test   %edi,0xf7725000    ;   {poll}
  0xf36c9fe8: cmp    %ebp,%ebx
  0xf36c9fea: jl     0xf36c9fd0
  0xf36c9fec: add    $0x8,%esp
  0xf36c9fef: pop    %ebp
  0xf36c9ff0: test   %eax,0xf7725000    ;   {poll_return}
  0xf36c9ff6: ret    
  0xf36c9ff7: mov    $0xfffffff6,%ecx
  0xf36c9ffc: xchg   %ax,%ax
  0xf36c9fff: call   0xf36a56a0         ; OopMap{off=100}
                                        ;*putfield x
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79)
                                        ;   {runtime_call}
  0xf36ca004: call   0xf6f877a0         ;   {runtime_call}

未注释酒吧= vfoo （不可扩展，速度较慢）版本：

The uncommented bar = vfoo (non-scalable, slower) version:

[Verified Entry Point]
  0xf3771aac: mov    %eax,-0x3000(%esp)
  0xf3771ab3: push   %ebp
  0xf3771ab4: sub    $0x8,%esp
  0xf3771aba: mov    0x68(%ecx),%ebx
  0xf3771abd: test   %ebx,%ebx
  0xf3771abf: jle    0xf3771afe
  0xf3771ac1: xor    %ebx,%ebx
  0xf3771ac3: nopw   0x0(%eax,%eax,1)
  0xf3771acc: xchg   %ax,%ax
  0xf3771ad0: mov    0x6c(%ecx),%ebp
  0xf3771ad3: test   %ebp,%ebp
  0xf3771ad5: je     0xf3771b09
  0xf3771ad7: movl   $0x1,0x8(%ebp)

-------------------------------------------------

  0xf3771ade: mov    0x6c(%ecx),%ebp
  0xf3771ae1: mov    %ebp,0x70(%ecx)
  0xf3771ae4: mov    0x68(%ecx),%edi
  0xf3771ae7: inc    %ebx
  0xf3771ae8: mov    %ecx,%eax
  0xf3771aea: shr    $0x9,%eax
  0xf3771aed: movb   $0x0,-0x3113c300(%eax)  ; OopMap{ecx=Oop off=84}
                                        ;*goto
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@29 (line 83)

-----------------------------------------------

  0xf3771af4: test   %edi,0xf77ce000    ;   {poll}
  0xf3771afa: cmp    %edi,%ebx
  0xf3771afc: jl     0xf3771ad0
  0xf3771afe: add    $0x8,%esp
  0xf3771b01: pop    %ebp
  0xf3771b02: test   %eax,0xf77ce000    ;   {poll_return}
  0xf3771b08: ret    
  0xf3771b09: mov    $0xfffffff6,%ecx
  0xf3771b0e: nop    
  0xf3771b0f: call   0xf374e6a0         ; OopMap{off=116}
                                        ;*putfield x
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79)
                                        ;   {runtime_call}
  0xf3771b14: call   0xf70307a0         ;   {runtime_call}

在这两个版本的差异是在 --------- 。我期望能发现在装配同步指令可能占到性能问题 - 而一些额外的移， MOV 和 INC 的说明，可能会影响性能绝对数字，我不明白他们如何影响可扩展性。

The differences in the two versions are within ---------. I expected to find synchronization instructions in the assembly which might account for the performance issue - while few extra shift, mov and inc instructions might affect absolute performance numbers, I don't see how they could affect scalability.

所以，我怀疑这是某种相关的存储到字段中类内存的问题。在另一方面，我也倾向于认为，JIT做一些事情很有趣，因为在一个迭代的测量时间的是的两倍，因为它应该是。

So, I suspect that this is some sort of a memory issue related to storing to a field in the class. On the other hand, I'm also inclined to believe that the JIT does something funny, because in one iteration the measured time is twice as fast, as it should be.

谁能解释一下是怎么回事？
请precise，包括支持你的观点的引用。

Can anyone explain what is going on here? Please be precise and include references that support your claims.

感谢您！

编辑：

下面是字节code为快（可扩展）版本：

Here's the bytecode for the fast (scalable) version:

public void run();
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 83: 18
   line 85: 24



  Code:
   Stack=2, Locals=2, Args_size=1
   0:   iconst_0
   1:   istore_1
   2:   iload_1
   3:   aload_0
   4:   getfield    #7; //Field sz:I
   7:   if_icmpge   24
   10:  aload_0
   11:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   14:  iconst_1
   15:  putfield    #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I
   18:  iinc    1, 1
   21:  goto    2
   24:  return
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 83: 18
   line 85: 24

  StackMapTable: number_of_entries = 2
   frame_type = 252 /* append */
     offset_delta = 2
     locals = [ int ]
   frame_type = 21 /* same */

慢（不可扩展）版本酒吧= vfoo ：

public void run();
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 82: 18
   line 83: 26
   line 85: 32



  Code:
   Stack=2, Locals=2, Args_size=1
   0:   iconst_0
   1:   istore_1
   2:   iload_1
   3:   aload_0
   4:   getfield    #7; //Field sz:I
   7:   if_icmpge   32
   10:  aload_0
   11:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   14:  iconst_1
   15:  putfield    #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I
   18:  aload_0
   19:  aload_0
   20:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   23:  putfield    #6; //Field bar:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   26:  iinc    1, 1
   29:  goto    2
   32:  return
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 82: 18
   line 83: 26
   line 85: 32

  StackMapTable: number_of_entries = 2
   frame_type = 252 /* append */
     offset_delta = 2
     locals = [ int ]
   frame_type = 29 /* same */

越多，我这个尝试，在我看来，这已经无关挥发物 - 它有什么写作做对象字段。我的直觉是，这是一个莫名其妙的内存争的问题 - 这一点与高速缓存和假共享，虽然根本没有明确的同步

The more I am experimenting with this, it seems to me that this has nothing to do with volatiles at all - it has something to do with writing to object fields. My hunch is that this is somehow a memory contention issue - something with caches and false sharing, although there is no explicit synchronization at all.

编辑2：

有趣的是，改变程序是这样的：

Interestingly, changing the program like this:

final class Holder {
    public Foo bar = null;
}

final class Reader extends Thread {
    volatile Foo vfoo = new Foo();
    Holder holder = null;
    int sz;

    public Reader(int _sz) {
        sz = _sz;
    }

    public void run() {
        int i = 0;
        holder = new Holder();
        while (i < sz) {
            vfoo.x = 1;
            holder.bar = vfoo;
            i++;
        }
    }
}

解决了规模问题。显然，在持有人对象上方被创建的线程启动之后，并且很可能在不同的段的存储器，其然后被同时修改的分配，而不是修改字段栏在该线程对象，这在某种程度上是不同的线程实例之间的内存关闭。

resolves the scaling issue. Apparently, the Holder object above gets created after the thread is started, and is probably allocated in a different segment of memory, which is then being modified concurrently, as opposed to modifying the field bar in the thread object, which is somehow "close" in memory between different thread instances.

为什么读取的易失性和写入字段构件是不可扩展的Java中？ [英] Why reading a volatile and writing to a field member is not scalable in Java?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

为什么读取的易失性和写入字段构件是不可扩展的Java中？ [英] Why reading a volatile and writing to a field member is not scalable in Java?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭