内存不足:使用hashset进行多线程处理 [英] Out of memory : Multithreading using hashset

查看:201
本文介绍了内存不足:使用hashset进行多线程处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经实现了一个java程序。这基本上是一个具有固定线程数的多线程服务。每个线程一次执行一个任务,创建一个hashSet,hashset的大小可以在单个hashset中从10到20,000个项目变化。在每个线程结束时,结果将使用synchronized添加到共享集合List。

I have implemented a java program . This is basically a multi threaded service with fixed number of threads. Each thread takes one task at a time, create a hashSet , the size of hashset can vary from 10 to 20,000+ items in a single hashset. At end of each thread, the result is added to a shared collection List using synchronized.

问题发生在某些时候我开始出现内存异常。在进行了一些研究之后,我发现当GC忙于清除内存时会发生这种内存异常,此时它会阻止整个世界执行任何操作。

The problem happens is at some point I start getting out of memory exception. Now after doing bit of research, I found that this memory exception occurs when GC is busy clearing the memory and at that point it stops the whole world to execute anything.

请给我如何处理如此大量数据的建议。 Hashset是否是一个正确的数据结构?如何处理内存异常,我的意思是一种方法是使用System.GC(),这又不好,因为它会减慢整个过程。或者在将其添加到共享集合列表后可以处理HashSet hsN吗?

Please give me suggestions for how to deal with such large amount of data. Is Hashset a correct datastructure to be used? How to deal with memory exception, I mean one way is to use System.GC(), which is again not good as it will slow down the whole process. Or is it possible to dispose the "HashSet hsN" after I add it to the shared collection List?

请让我知道你的想法并指导我,无论我哪里出错。这项服务将处理大量的数据处理。

Please let me know your thoughts and guide me for wherever I am going wrong. This service is going to deal with huge amout of data processing.

谢谢

//business object - to save the result of thread execution

public class Location{

    integer taskIndex;
    HashSet<Integer> hsN;
}



//task to be performed by each thread


public class MyTask implements Runnable {


    MyTask(long task) {
        this.task = task;
    }

    @Override
    public void run() {
              HashSet<Integer> hsN = GiveMeResult(task);//some function calling which returns a collection of integer where the size vary from 10 to 20000

        synchronized (locations) {
            locations.add(task,hsN);
        }
    }
}


public class Main {

    private static final int NTHREDS = 8;
    private static List<Location> locations;

    public static void main(String[] args) {
        ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
        for (int i = 0; i < 216000; i++) {
            Runnable worker = new MyTask(i);
            executor.execute(worker);
        }
        // This will make the executor accept no new threads
        // and finish all existing threads in the queue
        executor.shutdown();
        // Wait until all threads are finish
        while (!executor.isTerminated()) {

        }
        System.out.println("Finished all threads");
    }
}

对于这样的实现,JAVA是最佳选择或C#。 NET4?

For such implementation is JAVA a best choice or C# .net4?

推荐答案

我可以看到几个问题:


  • 您在 MyTask 对象上进行同步,该对象是为每次执行单独创建的。您应该在共享对象上进行同步,最好是您正在修改的对象,即位置对象。

  • You synchronize on the MyTask object, which is created separately for each execution. You should be synchronizing on a shared object, preferably the one that you are modifying i.e. the locations object.

216,000次运行,再乘以10,000个返回的对象,每个乘以最少12个字节,每个整数对象大约是24 GB 的内存。您是否在计算机上拥有那么多物理内存,更不用说JVM可用了?

216,000 runs, multiplied by say 10,000 returned objects each, multiplied by a minimum of 12 bytes per Integer object is about 24 GB of memory. Do you even have that much physical memory available on your computer, let alone available to the JVM?

32位JVM的堆大小限制小于2 GB。另一方面,在64位JVM上, Integer 对象大约需要16个字节,这会将内存需求提高到30 GB以上。

32-bit JVMs have a heap size limit of less than 2 GB. On a 64-bit JVM on the other hand, an Integer object takes about 16 bytes, which raises the memory requirements to over 30 GB.

有了这些数字,你得到 OutOfMemoryError ......

With these numbers it's hardly surprising that you get an OutOfMemoryError...

PS就不足为奇了:如果你有那么多可用的物理内存并且仍然认为你做的是正确的事情,你可能想看看调整JVM堆大小

PS: If you do have that much physical memory available and you still think that you are doing the right thing, you might want to have a look at tuning the JVM heap size.

编辑:

即使有25GB的内存可用到JVM它仍然可以推动它:

Even with 25GB of memory available to the JVM it could still be pushing it:


  • 每个整数对象在现代64位JVM上需要16个字节。

  • Each Integer object requires 16 bytes on modern 64-bit JVMs.

您还需要一个指向它的8字节引用,无论您使用哪种 List 实施。

You also need an 8-byte reference that will point to it, regardless of which List implementation you are using.

如果您使用的是链表实现,则每个条目的列表条目对象的开销也至少为24字节。

If you are using a linked list implementation, each entry will also have an overhead of at least 24 bytes for the list entry object.

充其量你可能希望以25GB存储大约1,000,000,000 Integer 对象 - 如果是您正在使用链接列表。这意味着每个任务平均不会产生超过5,000个(分别为2,500个)对象而不会导致错误。

At best you could hope to store about 1,000,000,000 Integer objects in 25GB - half that if you are using a linked list. That means that each task could not produce more than 5,000 (2,500 respectively) objects on average without causing an error.

我不确定您的具体要求,但您是否考虑过返回一个更紧凑的对象?例如,从每个 HashSet 生成的 int [] 数组只保留每个结果的最小4个字节而没有该对象容器开销。

I am unsure of your exact requirement, but have you considered returning a more compact object? For example an int[] array produced from each HashSet would only keep the minimum of 4 bytes per result without the object container overhead.

编辑2:

我刚才意识到你正在存储列表中的 HashSet 对象本身。 HashSet 对象在内部使用 HashMap 然后使用 HashMap.Entry 每个条目的对象。在64位JVM上,除了存储的对象之外,入口对象还消耗大约40个字节的内存:

I just realized that you are storing the HashSet objects themselves in the list. HashSet objects use a HashMap internally which then uses a HashMap.Entry object of each entry. On an 64-bit JVM the entry object consumes about 40 bytes of memory in addition to the stored object:


  • 关键参考它指向 Integer 对象 - 8个字节。

  • The key reference which points to the Integer object - 8 bytes.

值引用(在HashSet中总是 null ) - 8个字节。

The value reference (always null in a HashSet) - 8 bytes.

下一个条目引用 - 8个字节。

The next entry reference - 8 bytes.

哈希值 - 4个字节。

The hash value - 4 bytes.

对象开销 - 8个字节。

The object overhead - 8 bytes.

对象填充 - 4个字节。

Object padding - 4 bytes.

即对于每个 Integer 对象,您需要56个字节存储在 HashSet 中。如果典型的 HashMap 加载因子为0.75,则应为 HashMap 数组引用添加另外10个或多个字节。对于每个 Integer 66个字节,您只能以25 GB存储大约400,000,000个这样的对象,而不考虑应用程序的其余任何任何其他开销。每个任务的对象少于2,000个对象......

I.e. for each Integer object you need 56 bytes for storage in a HashSet. With the typical HashMap load factor of 0.75, you should add another 10 or bytes for the HashMap array references. With 66 bytes per Integer you can only store about 400,000,000 such objects in 25 GB, without taking into account the rest of your application any any other overhead. That's less than 2,000 object per task...

编辑3:

你最好存储排序的 int [] 数组而不是 HashSet 。对于任何任意整数,该数组在对数时间内是可搜索的,并且将每个数字的内存消耗最小化为4个字节。考虑到内存I / O,它也将与 HashSet 实现一样快(或更快)。

You would be better off storing a sorted int[] array instead of a HashSet. That array is searchable in logarithmic time for any arbitrary integer and minimizes the memory consumption to 4 bytes per number. Considering the memory I/O it would also be as fast (or faster) as the HashSet implementation.

这篇关于内存不足:使用hashset进行多线程处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆