比较 java 中的 2 个非常大的数组列表 [英] Comparing 2 very large arraylists in java

查看:24
本文介绍了比较 java 中的 2 个非常大的数组列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您需要相互比较 2 个非常大的数组列表时,正确的方法是什么?

What would be the correct approach when you need to compare 2 very large arraylists with each other?

这些数组列表的大小都是 100,000 个项目,并且在简单地比较每个项目的项目时肯定会崩溃.

These arraylist are both 100,000 items in size and will definitely crash when simply comparing item per item.

for (CItem c : cItems) {
        for (CItem r : rItems) {
            if (c.getID().equals(r.getID())) {
                Mismatch m = compareItems(c, r);
                if (m != null) {
                    mismatches.add(m);
                }
            }
        }
    }

现在我不是 100% 确定垃圾收集在这种情况下如何工作,但我们得到的错误是:

Now I'm not 100% sure how the garbage collection works in this situation but the errors we get are:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664) ~[na:1.8.0_73]
at java.lang.String.<init>(String.java:207) ~[na:1.8.0_73]
at java.lang.StringBuilder.toString(StringBuilder.java:407) ~[na:1.8.0_73]

java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181) ~[na:1.8.0_73]
at java.util.ArrayList.grow(ArrayList.java:261) ~[na:1.8.0_73]
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) ~[na:1.8.0_73]
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) ~[na:1.8.0_73]
at java.util.ArrayList.add(ArrayList.java:458) ~[na:1.8.0_73]

目前可能的解决方案是

  • 将每个列表分成最多 x 个项目并比较这些多个列表(有点详细)
  • 创建一个新数据库并查询每个项目(这会很慢而且现在不可行)
  • 购买 200 GB 的内存

对此问题的任何意见将不胜感激.

Any input on this matter would be appreciated.

推荐答案

如果任何 item-list 中的 ID 是唯一的,您可以为您的 rItems 使用 MapID 为键.

If the IDs in any item-list are unique, you can use a Map for your rItems with the ID as key.

Map<Long, CItem> rItemMap = new HashMap<>(rItems.size());
for (CItem r : rItems) {
    rItemMap.put(r.getID(), r);
}

现在您可以直接检查具有相同 ID 的 rItem:

Now you can check directly for rItems with same ID:

for (CItem c : cItems) {
    CItem r = rItemMap.get(c.getID());
    if (r != null) {
        Mismatch m = compareItems(c, r);
        if (m != null) {
            mismatches.add(m);
        }
    }
}

即使 ID 不是唯一的,您仍然可以使用 Map,您只需将具有该 ID 的所有项目的列表作为一个 Map.Entry 的值,并且您只需要迭代那几个项目而不是遍历整个列表.

Even if the IDs are not unique, you could still work with a Map, you just would have a List of all items with that ID as the value of one Map.Entry and you'd only have to iterate over those few items instead of iterating over the whole list.

关于 OutOfMemory 的编辑

我刚刚从您的异常中看到,您正在使用 ArrayList.改用 LinkedList 可能会有所帮助,因为 ArrayList 基于一个(固定大小)数组,当该数组被填满时,会分配一个更大的新数组,并复制旧数组中的数据到新数组然后释放.

I just saw from your Exception, that you're using ArrayList. Using LinkedList instead might help, because the ArrayList is based on a (fixed size) array and when that array is filled up, a new - larger - array is allocated and the data from the old array is copied to the new array and then freed.

所以如果你有一个大小为 1000 的数组并且它已经满了,一个新的数组,例如分配大小为 2000.那时,需要 3000 个项目的内存(尽管 1000 个不久之后被释放).

So if you have an array of size 1000 and it is full, a new array of e.g. size 2000 is allocated. At that moment, memory for 3000 items is required (although the 1000 are freed shortly after).

LinkedList 只是为您添加的每个项目分配内存(加上指向下一个和上一个元素的内存).

A LinkedList just allocates memory for every item you add to it (plus memory to point to the next and previous element).

这篇关于比较 java 中的 2 个非常大的数组列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆