对Java虚拟机和内存争用的数组分配和访问 [英] Array allocation and access on the Java Virtual Machine and memory contention

查看:158
本文介绍了对Java虚拟机和内存争用的数组分配和访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

观察下面的线程子类定义(为了方便,整个可运行的Java源文件包含在问题的结尾):

  final class Worker extends Thread {
Foo [] array = new Foo [1024];
int sz;

public Worker(int _sz){
sz = _sz;
}

public void run(){
// Foo [] arr = new Foo [1024]
Foo [] arr = array;
loop(arr);
}

public void loop(Foo [] arr){
int i = 0;
int pos = 512;
Foo v = new Foo();
while(i< sz){
if(i%2 == 0){
arr [pos] = v;
pos + = 1;
} else {
pos - = 1;
v = arr [pos];
}
i ++;
}
}
}

strong>:程序启动 -Dpar 这样的线程,并将每个线程的 sz 设置为 -Dsize / -Dpar ,其中 -Dsize -Dpar 命令行运行程序时。每个线程对象都有一个字段数组,它用一个新的 1024 - 元素数组初始化。推理是我们想在不同数量的线程之间划分等量的工作 - 我们期望程序可以扩展。



然后每个线程启动,时间测量所有线程完成所需的时间。我们做多个测量来抵消任何JIT相关的影响,如下所示。每个线程都有一个循环。在循环中,线程在偶数迭代中在数组中的位置 512 读取一个元素,并在 512 在奇数迭代。



完整程序如下。



使用 -verbose:gc 测试 - 在此程序运行期间没有发生垃圾回收。 / p>

运行命令:

  java -Xmx512m -Xms512m -server -Dsize = 500000000 -Dpar = 1 org.scalapool.bench.MultiStackJavaExperiment 7 

情况1: code> 1,2,4,8 线程,按此顺序(7个重复):

 >>>>所有运行时间:[2149,2227,1974,1948,1803,2283,1878] 
>>>所有运行时间:[1140,1124,2022,1141,2028,2004,2136]
>>>>所有运行时间:[867,1022,1457,1342,1436,966,1531]
>>>>所有执行时间:[915,864,1245,1243,948,790,1007]

是非线性缩放是由于内存争用。顺便说一下,早期迭代实际上做得更好 - 这可能是因为在不同的迭代中数组分配在不同的内存区域。



情况2:接下来,我评论在运行方法中的 Foo [] arr = array 行并在 run 方法本身: Foo [] arr = new Foo [1024] 。测量:

 >>>所有运行时间:[2053,1966,2089,1937,2046,1909,2011] 
>>>>所有运行时间:[1048,1178,1100,1194,1367,1271,1207]
>>>所有执行时间:[578,508,589,571,617,643,645]
>>>>所有执行时间:[330,299,300,322,331,324,575]

,一切都像预期的一样。我不会想象,数组分配的位置发挥任何作用,但显然它做了某种方式。我的想法是,数组以前分配得如此接近,以至于发生了一些内存争用。



情况3:为了验证这个假设,我已经取消了注释 Foo [] arr = array ,但这次将数组初始化为 new Foo [32000] 以确保正在写入的内存中的位置彼此足够远。因此,这里我们使用在创建线程对象期间分配的数组,与CASE1的区别只是数组更大。

 >>>>所有运行时间:[2113,1983,2430,2485,2333,2359,2463] 
>>>所有执行时间:[1172,1106,1163,1181,1142,1169,1188]
>>>>所有运行时间:[578,677,614,604,583,637,597]
>>>>所有运行时间:[343,327,320,330,353,320,320]

平台信息:

  Ubuntu Server 10.04.3 LTS 
8核英特尔(R)Xeon(R)CPU X5355 @ 2.66GHz
〜20GB ram
java版本1.6.0_26
Java TM SE运行时环境(build 1.6.0_26-b03)
Java HotSpot TM 64位服务器VM(构建20.1-b02,混合模式)

问题:这显然是一个内存争用问题。但为什么会发生这种情况?


  1. 如果是这样,是否意味着在CASE2中的 run 方法中创建时,整个数组是否分配在堆栈上?此运行时优化的确切条件是什么?当然,这个数组不会在堆栈上分配给100万个元素?


  2. 即使数组被分配在堆栈上, b $ b堆,不同线程的两个数组访问应该至少被划分512 * 4bytes = 2kb,即使在CASE1,数组是!这肯定比任何L1高速缓存行。如果这些影响是由于假共享,如何写到几个完全独立的缓存行会影响性能这么多? (这里的一个假设是,每个数组占用JVM上的一个连续的内存块,这是在创建数组时分配的,我不确定这是否有效,另一个假设是数组写入不会一直到内存,而是L1缓存,因为Intel Xeon确实有一个ccNUMA架构 - 如果我错了,请更正我)


  3. 自己的本地堆部分,它独立地分配新对象,这是在线程中分配数组时引起争用较少的原因?


  4. 为什么将数组大小增加到〜32000个元素提高了可伸缩性(减少了内存争用)?


请准确,并支持您的索赔。 / p>

谢谢!






整个runnable Java程式: / p>

  import java.util.ArrayList; 

class MultiStackJavaExperiment {

final class Foo {
int x = 0;
}

final class Worker extends Thread {
Foo [] array = new Foo [1024]
int sz;

public Worker(int _sz){
sz = _sz;
}

public void run(){
Foo [] arr = new Foo [1024]
// Foo [] arr = array;
loop(arr);
}

public void loop(Foo [] arr){
int i = 0;
int pos = 512;
Foo v = new Foo();
while(i< sz){
if(i%2 == 0){
arr [pos] = v;
pos + = 1;
} else {
pos - = 1;
v = arr [pos];
}
i ++;
}
}
}

public static void main(String [] args){
(new MultiStackJavaExperiment())。mainMethod(args);
}

int size = Integer.parseInt(System.getProperty(size));
int par = Integer.parseInt(System.getProperty(par));

public void mainMethod(String [] args){
int times = 0;
if(args.length == 0)times = 1;
else times = Integer.parseInt(args [0]);
ArrayList<长> measurement = new ArrayList<长> ();

for(int i = 0; i< times; i ++){
long start = System.currentTimeMillis();
run();
long end = System.currentTimeMillis();

long time =(end - start);
System.out.println(i +)运行时间:+时间+ms);
measurements.add(time);
}

System.out.println(>>>);
System.out.println(>>>所有运行时间:+测量);
System.out.println(>>>);
}

public void run(){
int sz = size / par;
ArrayList<线程> threads = new ArrayList<线程> ();

for(int i = 0; i threads.add(new Worker(sz));
threads.get(i).start();
}
for(int i = 0; i try {
threads.get(i).join();
} catch(Exception e){}
}
}

}


解决方案

解决方案



运行JVM c $ c> -XX:+ UseCondCardMark 标志,仅在JDK7中可用。



基本上,大多数托管堆的环境使用卡表来标记发生写入的内存区域。一旦写入,这样的存储区域在卡表中被标记为。垃圾收集需要此信息 - 不必扫描非脏内存区域的引用。卡是连续的存储器块,通常为512字节。卡表通常对于每个卡具有1个字节 - 如果该字节被设置,则卡是脏的。这意味着具有64字节的卡表覆盖64×512字节的存储器。通常,当前的高速缓存行大小为64字节。



因此,每次写入对象字段时,卡表中相应卡的字节必须为设置为脏。在单线程程序中有用的优化是通过简单地标记相关字节来做到这一点 - 每次写一个写。另一种方法是首先检查字节是否被设置,并且条件写入需要额外的读取和条件跳转,稍慢一点。



然而,这种优化可能是灾难性的在存在多个处理器写入存储器的情况下,由于被写入的相邻卡需要写入卡表中的相邻字节。因此,被写入的存储器区域(上面的数组中的条目)不在同一高速缓存行中,这是存储器争用的通常原因。真正的原因是写入的脏字节在同一个缓存行中。



上面的标志是什么 - 它实现了卡表脏字节写​​入首先检查该字节是否已经设置,并且如果不是,则设置它。这样,内存争用只发生在首次写入该卡时 - 之后,只有对该高速缓存行的读操作才会发生。由于高速缓存行仅被读取,它可以在多个处理器之间复制,并且他们不必同步读取它。



我观察到这个标志



-XX:+ UseCondCardMark 在单线程情况下增加运行时间约15-20%标志在此博客文章和此



相关并发邮件列表讨论: JVM上的阵列分配和访问


Observe the following definition of a thread subclass (the entire runnable Java source file is included at the end of the question for your convenience):

final class Worker extends Thread {
    Foo[] array = new Foo[1024];
    int sz;

    public Worker(int _sz) {
        sz = _sz;
    }

    public void run() {
        //Foo[] arr = new Foo[1024];
        Foo[] arr = array;
        loop(arr);
    }

    public void loop(Foo[] arr) {
        int i = 0;
        int pos = 512;
        Foo v = new Foo();
        while (i < sz) {
            if (i % 2 == 0) {
                arr[pos] = v;
                pos += 1;
            } else {
                pos -= 1;
                v = arr[pos];
            }
            i++;
        }
    }
}

Explanation: The program starts -Dpar such threads, and sets the sz of each thread to -Dsize / -Dpar, where -Dsize and -Dpar are set through the command line when running the program. Each thread object has a field array which is initialized with a fresh 1024-element array. The reasoning is that we want to divide an equal amount of work between a different number of threads - we expect the program to scale.

Each thread is then started and the time needed for all the threads to complete is measured. We do multiple measurements to counter any JIT related effects, as shown below. Each thread does a loop. Within the loop, the thread reads an element at the position 512 in the array in even iterations, and writes the same element at 512 in odd iterations. Only local variables are modified otherwise.

Full program is below.

Analysis:

Tested with -verbose:gc - there is no garbage collection occurring during the run of this program.

Run command:

java -Xmx512m -Xms512m -server -Dsize=500000000 -Dpar=1 org.scalapool.bench.MultiStackJavaExperiment 7

CASE 1: Running times for 1,2,4,8 threads, in that order (7 repetitions):

>>> All running times: [2149, 2227, 1974, 1948, 1803, 2283, 1878]
>>> All running times: [1140, 1124, 2022, 1141, 2028, 2004, 2136]
>>> All running times: [867, 1022, 1457, 1342, 1436, 966, 1531]
>>> All running times: [915, 864, 1245, 1243, 948, 790, 1007]

My thought was that the nonlinear scaling is due to memory contention. Incidentally, early iterations actually do better - this might be due to the fact that in different iterations the arrays are allocated in different memory areas.

CASE 2: Next, I comment the Foo[] arr = array line in the run method of the thread and allocate a new array in the run method itself: Foo[] arr = new Foo[1024]. Measurements:

>>> All running times: [2053, 1966, 2089, 1937, 2046, 1909, 2011]
>>> All running times: [1048, 1178, 1100, 1194, 1367, 1271, 1207]
>>> All running times: [578, 508, 589, 571, 617, 643, 645]
>>> All running times: [330, 299, 300, 322, 331, 324, 575]

This time, everything scales pretty much as expected. I wouldn't have imagined that the location where the array was allocated plays any role whatsoever, but obviously it does somehow. My thought was that the arrays were previously allocated so close to each other that some memory contention started happening.

CASE 3: To verify this assumption, I've uncommented the line Foo[] arr = array again, but this time initialized the array field to new Foo[32000] to ensure that the location in memory being written to are sufficiently far from each other. So, here we're using the array allocated during the creation of the thread object again, the difference with CASE1 is only that the array is bigger.

>>> All running times: [2113, 1983, 2430, 2485, 2333, 2359, 2463]
>>> All running times: [1172, 1106, 1163, 1181, 1142, 1169, 1188]
>>> All running times: [578, 677, 614, 604, 583, 637, 597]
>>> All running times: [343, 327, 320, 330, 353, 320, 320]

So, memory contention seems to be the cause of this.

The platform information:

Ubuntu Server 10.04.3 LTS
8 core Intel(R) Xeon(R) CPU  X5355  @2.66GHz
~20GB ram
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

Question: This is obviously a memory-contention issue. But why is this happening?

  1. Is the escape analysis kicking in? If so, does it mean that the entire array is allocated on the stack when created in the run method in CASE2? What are the exact conditions for this runtime optimization? Surely the array is not allocated on the stack for 1 million elements?

  2. Even if the array is being allocated on the stack as opposed to being allocated on the heap, two array accesses by different threads should be divided by at least 512 * 4bytes = 2kb even in CASE1, wherever the arrays are! That's definitely larger than any L1 cache-line. If these effects are due to false sharing, how can writes to several totally independent cache-lines affect performance this much? (One assumption here is that each array occupies a contiguous block of memory on the JVM, which is allocated when the array is created. I'm not sure this is valid. Another assumption is that array writes don't go all the way to memory, but L1 cache instead, as Intel Xeon does have a ccNUMA architecture - correct me if I'm wrong)

  3. Is it possible that each thread has its own local heap part where it independently allocates new objects, and this is the cause for lower contention when the array is allocated in the thread? If so, how is that area of heap garbage collected if references get shared?

  4. Why has increasing the array size to ~32000 elements improved scalability (decreased memory contention)? What exactly in the memory hierarchy is the cause of this?

Please be precise and support your claims with references.

Thank you!


The entire runnable Java program:

import java.util.ArrayList;

class MultiStackJavaExperiment {

    final class Foo {
        int x = 0;
    }

    final class Worker extends Thread {
        Foo[] array = new Foo[1024];
        int sz;

        public Worker(int _sz) {
            sz = _sz;
        }

        public void run() {
            Foo[] arr = new Foo[1024];
            //Foo[] arr = array;
            loop(arr);
        }

        public void loop(Foo[] arr) {
            int i = 0;
            int pos = 512;
            Foo v = new Foo();
            while (i < sz) {
                if (i % 2 == 0) {
                    arr[pos] = v;
                    pos += 1;
                } else {
                    pos -= 1;
                    v = arr[pos];
                }
                i++;
            }
        }
    }

    public static void main(String[] args) {
        (new MultiStackJavaExperiment()).mainMethod(args);
    }

    int size = Integer.parseInt(System.getProperty("size"));
    int par = Integer.parseInt(System.getProperty("par"));

    public void mainMethod(String[] args) {
        int times = 0;
        if (args.length == 0) times = 1;
        else times = Integer.parseInt(args[0]);
        ArrayList < Long > measurements = new ArrayList < Long > ();

        for (int i = 0; i < times; i++) {
            long start = System.currentTimeMillis();
            run();
            long end = System.currentTimeMillis();

            long time = (end - start);
            System.out.println(i + ") Running time: " + time + " ms");
            measurements.add(time);
        }

        System.out.println(">>>");
        System.out.println(">>> All running times: " + measurements);
        System.out.println(">>>");
    }

    public void run() {
        int sz = size / par;
        ArrayList < Thread > threads = new ArrayList < Thread > ();

        for (int i = 0; i < par; i++) {
            threads.add(new Worker(sz));
            threads.get(i).start();
        }
        for (int i = 0; i < par; i++) {
            try {
                threads.get(i).join();
            } catch (Exception e) {}
        }
    }

}

解决方案

Solution

Run the JVM with the -XX:+UseCondCardMark flag, available only in JDK7. This solves the problem.

Explanation

Essentially, most managed-heap environments use card tables to mark the areas of memory into which writes occurred. Such memory areas are marked as dirty in the card table once the write occurs. This information is needed for garbage collection - references of the non-dirty memory areas don't have to be scanned. A card is a contiguous block of memory, typically 512 bytes. A card table typically has 1 byte for each card - if this byte is set, the card is dirty. This means that a card table with 64 bytes covers 64 * 512 bytes of memory. And typically, the cache line size today is 64 bytes.

So each time a write to an object field occurs, the byte of the corresponding card in the card table must be set as dirty. A useful optimization in single thread programs is to do this by simply marking the relevant byte - do a write each time. An alternative of first checking whether the byte is set and a conditional write requires an additional read and a conditional jump, which is slightly slower.

However, this optimization can be catastrophic in the event that there are multiple processors writing to the memory, as neighbouring cards being written to require a write to neighbouring bytes in the card table. So the memory area being written to (the entry in the array above) is not in the same cache-line, which is the usual cause of memory contention. The real reason is that the dirty bytes which are written to are in the same cache line.

What the above flag does is - it implements the card table dirty byte write by first checking if the byte is already set, and setting it only if it isn't. This way the memory contention happens only during the first write to that card - after that, only reads to that cache-line occur. Since the cache-line is only read, it can be replicated across multiple processors and they don't have to synchronize to read it.

I've observed that this flag increases the running time some 15-20% in the 1-thread case.

The -XX:+UseCondCardMark flag is explained in this blog post and this bug report.

The relevant concurrency mailing list discussion: Array allocation and access on the JVM.

这篇关于对Java虚拟机和内存争用的数组分配和访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆