在Java中比较2个非常大的arraylists [英] Comparing 2 very large arraylists in java

查看:174
本文介绍了在Java中比较2个非常大的arraylists的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您需要将两个非常大的数组列表相互比较时,正确的方法是什么?

What would be the correct approach when you need to compare 2 very large arraylists with each other?

这些arraylist的大小均为100,000,在简单比较每个项目时肯定会崩溃.

These arraylist are both 100,000 items in size and will definitely crash when simply comparing item per item.

for (CItem c : cItems) {
        for (CItem r : rItems) {
            if (c.getID().equals(r.getID())) {
                Mismatch m = compareItems(c, r);
                if (m != null) {
                    mismatches.add(m);
                }
            }
        }
    }

现在我不是100%知道在这种情况下垃圾收集的工作方式,但是我们得到的错误是:

Now I'm not 100% sure how the garbage collection works in this situation but the errors we get are:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664) ~[na:1.8.0_73]
at java.lang.String.<init>(String.java:207) ~[na:1.8.0_73]
at java.lang.StringBuilder.toString(StringBuilder.java:407) ~[na:1.8.0_73]

java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181) ~[na:1.8.0_73]
at java.util.ArrayList.grow(ArrayList.java:261) ~[na:1.8.0_73]
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) ~[na:1.8.0_73]
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) ~[na:1.8.0_73]
at java.util.ArrayList.add(ArrayList.java:458) ~[na:1.8.0_73]

到目前为止,可能的解决方案是

So far the possible solutions are

  • 将每个列表分成最多x个项目,并比较这些多个列表(种类繁多)
  • 创建一个新的数据库并查询每个项目(这将非常缓慢并且目前尚不可行)
  • 购买200 gb的内存

对此事的任何投入,将不胜感激.

Any input on this matter would be appreciated.

推荐答案

如果任何项目列表中的ID是唯一的,则可以使用Map作为键,将Map用于rItems. >

If the IDs in any item-list are unique, you can use a Map for your rItems with the ID as key.

Map<Long, CItem> rItemMap = new HashMap<>(rItems.size());
for (CItem r : rItems) {
    rItemMap.put(r.getID(), r);
}

现在您可以直接检查具有相同ID的rItem:

Now you can check directly for rItems with same ID:

for (CItem c : cItems) {
    CItem r = rItemMap.get(c.getID());
    if (r != null) {
        Mismatch m = compareItems(c, r);
        if (m != null) {
            mismatches.add(m);
        }
    }
}

即使ID不是唯一的,您仍然可以使用Map,您只需拥有ID为一个Map值的所有项的列表.Entry只需迭代这几个项目,而不是遍历整个列表.

Even if the IDs are not unique, you could still work with a Map, you just would have a List of all items with that ID as the value of one Map.Entry and you'd only have to iterate over those few items instead of iterating over the whole list.

编辑内存不足

我刚刚从您的异常中看到,您正在使用ArrayList.相反,使用LinkedList可能会有所帮助,因为ArrayList基于(固定大小)数组,并且当该数组装满时,将分配一个新的-大-数组,并将旧数组中的数据复制到新数组中,然后释放.

I just saw from your Exception, that you're using ArrayList. Using LinkedList instead might help, because the ArrayList is based on a (fixed size) array and when that array is filled up, a new - larger - array is allocated and the data from the old array is copied to the new array and then freed.

因此,如果您有一个大小为1000的数组且已满,则使用一个新的数组,例如分配了大小2000.那时,需要存储3000个项目(尽管不久之后将释放1000个项目).

So if you have an array of size 1000 and it is full, a new array of e.g. size 2000 is allocated. At that moment, memory for 3000 items is required (although the 1000 are freed shortly after).

LinkedList只是为您添加到其中的每个项目分配内存(加上指向下一个和上一个元素的内存).

A LinkedList just allocates memory for every item you add to it (plus memory to point to the next and previous element).

这篇关于在Java中比较2个非常大的arraylists的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆