如果执行顺序(几乎)不受影响,如何在严重的性能下降中分配变量结果? [英] How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched?

查看:119
本文介绍了如果执行顺序(几乎)不受影响,如何在严重的性能下降中分配变量结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当玩多线程时,我可以观察到一些与AtomicLong(以及使用它的类,例如java.util.Random)相关的意外但严重的性能问题,我目前没有解释。但是,我创建了一个简约示例,它基本上由两个类组成:一个类Container,它保存对volatile变量的引用;一个类DemoThread,它在线程执行期间对Container的实例进行操作。请注意,对Container和volatile long的引用是私有的,并且从不在线程之间共享(我知道这里不需要使用volatile,它仅用于演示目的) - 因此,DemoThread的多个实例应该完美运行在多处理器机器上并行,但由于某种原因,它们没有(完整的例子在这篇文章的底部)。

When playing around with multithreading, I could observe some unexpected but serious performance issues related to AtomicLong (and classes using it, such as java.util.Random), for which I currently have no explanation. However, I created a minimalistic example, which basically consists of two classes: a class "Container", which keeps a reference to a volatile variable, and a class "DemoThread", which operates on an instance of "Container" during thread execution. Note that the references to "Container" and the volatile long are private, and never shared between threads (I know that there's no need to use volatile here, it's just for demonstration purposes) - thus, multiple instances of "DemoThread" should run perfectly parallel on a multiprocessor machine, but for some reason, they do not (Complete example is at the bottom of this post).

private static class Container  {

    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    public void run() {
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
    }
}

在我的测试中,我反复创建4个DemoThreads,然后启动并加入。每个循环的唯一区别是prepare()被调用的时间(这显然是线程运行所必需的,否则会导致NullPointerException):

During my test, I repeatedly create 4 DemoThreads, which are then started and joined. The only difference in each loop is the time when "prepare()" gets called (which is obviously required for the thread to run, as it otherwise would result in a NullPointerException):

DemoThread[] threads = new DemoThread[numberOfThreads];
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
    }

出于某种原因,如果在启动线程之前立即执行prepare(),它将花费两倍的时间来完成,即使没有volatile关键字,性能差异也很大,至少在我测试代码的两台机器和操作系统上。以下是简短摘要:

For some reason, if prepare() is executed immediately before starting the thread, it will take twice as more time to finish, and even without the "volatile" keyword, the performance differences were significant, at least on two of the machines and OS'es I tested the code. Here's a short summary:

Java版本:1.6.0_24

Java类版本:50.0

VM供应商:Sun Microsystems Inc.

VM版本:19.1-b02-334

虚拟机名称:Java HotSpot(TM)64位服务器虚拟机

操作系统名称:Mac OS X

操作系统Arch:x86_64

操作系统版本:10.6.5

处理器/内核:8

Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02-334
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Mac OS X
OS Arch: x86_64
OS Version: 10.6.5
Processors/Cores: 8

使用volatile关键字:

最终结果:

31979 ms。当实例化后调用prepare()时。

96482 ms。在执行之前调用prepare()时。

With volatile keyword:
Final results:
31979 ms. when prepare() was called after instantiation.
96482 ms. when prepare() was called before execution.

没有volatile关键字:

最终结果:

26009 ms。当实例化后调用prepare()时。

35196 ms。在执行之前调用prepare()时。

Without volatile keyword:
Final results:
26009 ms. when prepare() was called after instantiation.
35196 ms. when prepare() was called before execution.

Java版本:1.6.0_24

Java类版本:50.0

VM供应商:Sun Microsystems Inc.

VM版本:19.1-b02

VM名称:Java HotSpot(TM)64位服务器VM

操作系统名称:Windows 7

OS Arch:amd64

操作系统版本:6.1

处理器/核心:4

Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Windows 7
OS Arch: amd64
OS Version: 6.1
Processors/Cores: 4

使用volatile关键字:

最终结果:

18120 ms。当实例化后调用prepare()时。

36089 ms。在执行之前调用prepare()时。

With volatile keyword:
Final results:
18120 ms. when prepare() was called after instantiation.
36089 ms. when prepare() was called before execution.

没有volatile关键字:

最终结果:

10115 ms。当实例化后调用prepare()时。

10039 ms。在执行之前调用prepare()时。

Without volatile keyword:
Final results:
10115 ms. when prepare() was called after instantiation.
10039 ms. when prepare() was called before execution.

Java版本:1.6.0_20

Java类版本:50.0

VM供应商:Sun Microsystems Inc.

VM版本:19.0-b09

VM名称:OpenJDK 64位服务器VM

操作系统名称:Linux

OS Arch:amd64

操作系统版本:2.6.32-28-generic

处理器/核心:4

Java Version: 1.6.0_20
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.0-b09
VM Name: OpenJDK 64-Bit Server VM
OS Name: Linux
OS Arch: amd64
OS Version: 2.6.32-28-generic
Processors/Cores: 4

使用volatile关键字:

最终结果:

45848 ms。当实例化后调用prepare()时。

110754 ms。在执行之前调用prepare()时。

With volatile keyword:
Final results:
45848 ms. when prepare() was called after instantiation.
110754 ms. when prepare() was called before execution.

没有volatile关键字:

最终结果:

37862 ms。当实例化后调用prepare()时。

39357 ms。在执行之前调用prepare()时。

Without volatile keyword:
Final results:
37862 ms. when prepare() was called after instantiation.
39357 ms. when prepare() was called before execution.

测试1 ,4个线程,在创建循环中设置变量

Thread-2在653 ms之后完成。

Thread-3在653 ms之后完成。

Thread-4已完成在653毫秒之后。

线程-5在653毫秒后完成。

总时间:654毫秒。

Test 1, 4 threads, setting variable in creation loop
Thread-2 completed after 653 ms.
Thread-3 completed after 653 ms.
Thread-4 completed after 653 ms.
Thread-5 completed after 653 ms.
Overall time: 654 ms.

测试2,4线程,在启动循环中设置变量

在1588 ms后完成Thread-7。

Thread- 6 1589 ms后完成6。

1593 ms后完成Thread-8。

1593 ms后完成Thread-9。

总时间:1594 ms。

Test 2, 4 threads, setting variable in start loop
Thread-7 completed after 1588 ms.
Thread-6 completed after 1589 ms.
Thread-8 completed after 1593 ms.
Thread-9 completed after 1593 ms.
Overall time: 1594 ms.

测试3,4个线程,在创建循环中设置变量

Thread-10在648 ms后完成。

Thread-在648 ms之后完成12。

在648 ms之后完成Thread-13。

在648 ms之后完成Thread-11。

总时间:648 ms。

Test 3, 4 threads, setting variable in creation loop
Thread-10 completed after 648 ms.
Thread-12 completed after 648 ms.
Thread-13 completed after 648 ms.
Thread-11 completed after 648 ms.
Overall time: 648 ms.

测试4个线程,4个线程,在启动循环中设置变量

Thread-17在1353 ms后完成。

Thread-在1957 ms之后完成16次。

线程-14在2170 ms后完成。

线程-15在2169 ms后完成。

总时间:2172 ms。

Test 4, 4 threads, setting variable in start loop
Thread-17 completed after 1353 ms.
Thread-16 completed after 1957 ms.
Thread-14 completed after 2170 ms.
Thread-15 completed after 2169 ms.
Overall time: 2172 ms.

(依此类推,有时'慢'循环中的一个或两个线程按预期完成,但大部分时间没有完成。)

(and so on, sometimes one or two of the threads in the 'slow' loop finish as expected, but most times they don't).

给定的例子从理论上看,因为它没用,而且这里不需要'volatile' - 但是,如果你使用'java.util.Random' - Instance而不是'Container'-Class并且多次调用nextInt(),会发生相同的效果:如果在Thread的构造函数中创建对象,则线程将快速执行,但如果在Thread中创建它,则会很慢运行() - 方法。我相信 Mac OS上的Java随机减速中描述的性能问题超过了一年前与这种效果有关,但我不知道它为什么会这样 - 除此之外,我确信它不应该是那样的,因为它意味着在它内部创建一个新对象总是很危险的一个线程的run-method,除非你知道在对象图中不会涉及任何volatile变量。分析没有帮助,因为在这种情况下问题消失了(与中的观察结果相同) Mac OS上的Java随机减速(续),它也不会发生在单核PC上 - 所以我猜它是一种线程同步问题......然而,奇怪的是实际上没有什么可以同步的,因为所有变量都是线程本地的。

The given example looks theoretically, as it is of no use, and 'volatile' is not needed here - however, if you'd use a 'java.util.Random'-Instance instead of the 'Container'-Class and call, for instance, nextInt() multiple times, the same effects will occur: The thread will be executed fast if you create the object in the Thread's constructor, but slow if you create it within the run()-method. I believe that the performance issues described in Java Random Slowdowns on Mac OS more than a year ago are related to this effect, but I have no idea why it is as it is - besides that I'm sure that it shouldn't be like that, as it would mean that it's always dangerous to create a new object within the run-method of a thread, unless you know that no volatile variables will get involved within the object graph. Profiling doesn't help, as the problem disappears in this case (same observation as in Java Random Slowdowns on Mac OS cont'd), and it also does not happen on a single-core-PC - so I'd guess that it's kind of a thread synchronization problem... however, the strange thing is that there's actually nothing to synchronize, as all variables are thread-local.

真的期待任何提示 - 以防你想要确认或伪造问题,请参阅下面的测试用例。

Really looking forward for any hints - and just in case you want to confirm or falsify the problem, see the test case below.

谢谢,

Stephan

public class UnexpectedPerformanceIssue {

private static class Container  {

    // Remove the volatile keyword, and the problem disappears (on windows)
    // or gets smaller (on mac os)
    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    @Override
    public void run() {
        long start = System.nanoTime();
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
        long end = System.nanoTime();
        System.out.println(this.getName() + " completed after "
                +  ((end - start)/1000000) + " ms.");
    }
}

public static void main(String[] args) {
    System.out.println("Java Version: " + System.getProperty("java.version"));
    System.out.println("Java Class Version: " + System.getProperty("java.class.version"));

    System.out.println("VM Vendor: " + System.getProperty("java.vm.specification.vendor"));
    System.out.println("VM Version: " + System.getProperty("java.vm.version"));
    System.out.println("VM Name: " + System.getProperty("java.vm.name"));

    System.out.println("OS Name: " + System.getProperty("os.name"));
    System.out.println("OS Arch: " + System.getProperty("os.arch"));
    System.out.println("OS Version: " + System.getProperty("os.version"));
    System.out.println("Processors/Cores: " + Runtime.getRuntime().availableProcessors());

    System.out.println();
    int numberOfThreads = 4;

    System.out.println("\nReference Test (single thread):");
    DemoThread t = new DemoThread();
    t.prepare();
    t.run();

    DemoThread[] threads = new DemoThread[numberOfThreads];
    long createTime = 0, startTime = 0;
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        long overallStart = System.nanoTime();
        if(prepareAfterConstructor) {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in creation loop");             
        } else {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in start loop");
        }

        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            // Either call DemoThread.prepare() here (in odd loops)...
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            // or here (in even loops). Should make no difference, but does!
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
        long overallEnd = System.nanoTime();
        long overallTime = (overallEnd - overallStart);
        if(prepareAfterConstructor) {
            createTime += overallTime;
        } else {
            startTime += overallTime;
        }
        System.out.println("Overall time: " + (overallTime)/1000000 + " ms.");
    }
    System.out.println("Final results:");
    System.out.println(createTime/1000000 + " ms. when prepare() was called after instantiation.");
    System.out.println(startTime/1000000 + " ms. when prepare() was called before execution.");
}

private static void joinThreads(Thread[] threads) {
    for(int i = 0; i < threads.length; i++) {
        try {
            threads[i].join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

}

推荐答案

这可能是两个易变变量 a b 彼此太靠近,它们属于同一个缓存行;虽然CPU A 只读/写变量 a ,而CPU B 只读/写变量 b ,它们仍然通过相同的缓存行相互耦合。这些问题称为虚假共享

It's likely that two volatile variables a and b are too close to each other, they fall in the same cache line; although CPU A only reads/writes variable a, and CPU B only reads/writes variable b, they are still coupled to each other through the same cache line. Such problems are called false sharing.

在您的示例中,我们有两种分配方案:

In your example, we have two allocation schemes:

new Thread                               new Thread
new Container               vs           new Thread
new Thread                               ....
new Container                            new Container
....                                     new Container

在第一个方案中,两个易变量变量彼此接近的可能性非常小。在第二种方案中,几乎可以肯定。

In the first scheme, it's very unlikely that two volatile variables are close to each other. In the 2nd scheme, it's almost certainly the case.

CPU缓存不适用于单个单词;相反,他们处理缓存行。高速缓存行是连续的内存块,比如64个相邻字节。通常这很好 - 如果CPU访问一个单元,它很可能也会访问相邻的单元。除了您的示例,该假设不仅无效,而且有害。

CPU caches don't work with individual words; instead, they deal with cache lines. A cache line is a continuous chunk of memory, say 64 neighboring bytes. Usually this is nice - if a CPU accessed a cell, it's very likely that it will access the neighboring cells too. Except in your example, that assumption is not only invalid, but detrimental.

假设 a b 属于同一个缓存line L 。当CPU A 更新 a 时,它会通知其他CPU L 很脏。因为B也会缓存 L ,因为它正在 b B 必须删除其缓存的 L 。所以下次 B 需要读取 b ,它必须重新加载 L ,这是昂贵的。

Suppose a and b fall in the same cache line L. When CPU A updates a, it notifies other CPUs that L is dirty. Since B caches L too, because it's working on b, B must drop its cached L. So next time B needs to read b, it must reload L, which is costly.

如果 B 必须访问主内存才能重新加载,这是非常昂贵的,它是通常慢100倍。

If B must access main memory to reload, that is extremely costly, it's usually 100X slower.

幸运的是, A B 可以直接与新值进行通信而无需通过主存。然而,它需要额外的时间。

Fortunately, A and B can communicate directly about the new values without going through main memory. Nevertheless it takes extra time.

要验证这个理论,你可以在 Container 中填充额外的128个字节,这样两个 Container 的两个volatile变量不会落在同一个缓存行中;然后你应该观察到这两个方案大致需要执行相同的时间。

To verify this theory, you can stuff extra 128 bytes in Container, so that two volatile variable of two Container will not fall in the same cache line; then you should observe that the two schemes take about the same time to execute.

学到的经验:通常CPU假设adjecent变量是相关的。如果我们想要自变量,我们最好将它们彼此远离。

Lession learned: usually CPUs assume that adjecent variables are related. If we want independent variables, we better place them far away from each other.

这篇关于如果执行顺序(几乎)不受影响,如何在严重的性能下降中分配变量结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆