当单词数超过2亿时，如何使用Java删除重复的单词? [英] How to remove duplicate words using Java when words are more than 200 million?

查看：119 发布时间：2020/6/12 19:39:40 java duplicate-removal

本文介绍了当单词数超过2亿时，如何使用Java删除重复的单词?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件(大小=〜1.9 GB)，其中包含〜220,000,000(〜2.2亿)个单词/字符串.它们有重复项，每100个单词中几乎有1个重复单词.

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.

在第二个程序中，我想读取文件.我成功地使用BufferedReader逐行读取了文件.

In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.

现在要删除重复项，我们可以使用Set(及其实现)，但是Set存在问题，如以下3种不同情况所述:

Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:

使用默认的JVM大小，Set最多可以包含0.7-0.8百万个单词，然后包含OutOfMemoryError.
JVM大小为512M，Set最多可以包含5-6百万个单词，然后是OOM错误.
JVM大小为1024M时，Set最多可以包含12-13百万个字，然后是OOM错误.在将1000万条记录添加到Set中之后，操作变得非常缓慢.例如，添加下一个〜4000条记录需要60秒.

我有一些限制，我无法进一步增加JVM的大小，我想从文件中删除重复的单词.

I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.

如果您对使用Java从如此庞大的文件中删除重复单词的任何其他方法/方法有任何想法，请告诉我.非常感谢:)

Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)

问题的附加信息:我的单词基本上是字母数字，并且是我们系统中唯一的ID.因此，它们不是普通的英语单词.

Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

当单词数超过2亿时，如何使用Java删除重复的单词? [英] How to remove duplicate words using Java when words are more than 200 million?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

当单词数超过2亿时，如何使用Java删除重复的单词? [英] How to remove duplicate words using Java when words are more than 200 million?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭