查找大文件副本 [英] Find duplicates in large file

查看:119
本文介绍了查找大文件副本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有真正的大文件约15万个条目。 文件中的每一行包含一个字符串(称为键)。

I have really large file with approximately 15 million entries. Each line in the file contains a single string (call it key).

我需要找到使用Java文件中的重复的条目。 我试图用一个HashMap和检测重复的条目。 显然,这种做法是扔我一个java.lang.OutOfMemoryError:Java堆空间的错误。

I need to find the duplicate entries in the file using java. I tried to use a hashmap and detect duplicate entries. Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.

我该如何解决这个问题呢?

How can I solve this problem?

我想我可以增加堆空间和尝试,但我想知道是否有更好的有效的解决方案,而无需调整的堆空间。

I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.

推荐答案

关键在于你的数据不适合到内存中。您可以使用外部合并排序以这样的:

The key is that your data will not fit into memory. You can use external merge sort for this:

分区中文件分成多个小块装入内存。排序每个块,消除重复(现在的相邻元素)。

Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

合并块和合并时再消除重复。因为你将有一个正NWAY合并在这里你可以留着下次​​k个元素的每个块在内存中,一旦项目的一大块被耗尽(它们已经被合并)从磁盘中攫取更多的。

Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

这篇关于查找大文件副本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆