查找大文件副本 [英] Find duplicates in large file
问题描述
我有真正的大文件约15万个条目。 文件中的每一行包含一个字符串(称为键)。
I have really large file with approximately 15 million entries. Each line in the file contains a single string (call it key).
我需要找到使用Java文件中的重复的条目。 我试图用一个HashMap和检测重复的条目。 显然,这种做法是扔我一个java.lang.OutOfMemoryError:Java堆空间的错误。
I need to find the duplicate entries in the file using java. I tried to use a hashmap and detect duplicate entries. Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.
我该如何解决这个问题呢?
How can I solve this problem?
我想我可以增加堆空间和尝试,但我想知道是否有更好的有效的解决方案,而无需调整的堆空间。
I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.
推荐答案
关键在于你的数据不适合到内存中。您可以使用外部合并排序以这样的:
The key is that your data will not fit into memory. You can use external merge sort for this:
分区中文件分成多个小块装入内存。排序每个块,消除重复(现在的相邻元素)。
Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).
合并块和合并时再消除重复。因为你将有一个正NWAY合并在这里你可以留着下次k个元素的每个块在内存中,一旦项目的一大块被耗尽(它们已经被合并)从磁盘中攫取更多的。
Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.
这篇关于查找大文件副本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!