当单词数超过2亿时,如何使用Java删除重复的单词? [英] How to remove duplicate words using Java when words are more than 200 million?

查看:119
本文介绍了当单词数超过2亿时,如何使用Java删除重复的单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件(大小=〜1.9 GB),其中包含〜220,000,000(〜2.2亿)个单词/字符串.它们有重复项,每100个单词中几乎有1个重复单词.

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.

在第二个程序中,我想读取文件.我成功地使用BufferedReader逐行读取了文件.

In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.

现在要删除重复项,我们可以使用Set(及其实现),但是Set存在问题,如以下3种不同情况所述:

Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:

  1. 使用默认的JVM大小,Set最多可以包含0.7-0.8百万个单词,然后包含OutOfMemoryError.
  2. JVM大小为512M,Set最多可以包含5-6百万个单词,然后是OOM错误.
  3. JVM大小为1024M时,Set最多可以包含12-13百万个字,然后是OOM错误.在将1000万条记录添加到Set中之后,操作变得非常缓慢.例如,添加下一个〜4000条记录需要60秒.

我有一些限制,我无法进一步增加JVM的大小,我想从文件中删除重复的单词.

I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.

如果您对使用Java从如此庞大的文件中删除重复单词的任何其他方法/方法有任何想法,请告诉我.非常感谢:)

Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)

问题的附加信息:我的单词基本上是字母数字,并且是我们系统中唯一的ID.因此,它们不是普通的英语单词.

Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

推荐答案

使用合并排序并在第二遍中删除重复项.您甚至可以在合并时删除重复项(只需将添加的最新单词保留到RAM中的输出中,然后将候选单词也与之进行比较).

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

这篇关于当单词数超过2亿时,如何使用Java删除重复的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆