删除大文本文件中的重复单词 - Java [英] Delete duplicate words in a big text file - Java
问题描述
我的文本文件大小超过50GB。
现在我要删除重复的单词。
但我听说,我需要非常多的RAM来将每个Word从文本文件加载到哈希集中。
你能告诉我一个很好的方法来删除文本文件中的每个重复单词吗?
单词格式按空格排序。
I have text file with a size of over 50gb. Now i want to delete the duplicate words. But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set. Can you tell me a very good way to delete every duplicate word from the text file? The Words are sorted by a white space, like this.
word1 word2 word3 ... ...
推荐答案
H2答案很好,但可能有点过分。英语中的所有单词都不会超过几Mb。只需使用一套。您可以在RAnders00程序中使用它。
The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.
public static void read50Gigs(String fileLocation, String newFileLocation) {
Set<String> words = new HashSet<>();
try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
Scanner scanner = new Scanner(fileInputStream);) {
while (scanner.hasNext()) {
String nextWord = scanner.next();
words.add(nextWord);
}
System.out.println("words size "+words.size());
Files.write(Paths.get(newFileLocation), words,
StandardOpenOption.CREATE, StandardOpenOption.WRITE);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
作为常用词的估计,我添加了这个战争与和平(来自古腾堡)
As an estimate of common words, I added this for war and peace (from gutenberg)
public static void read50Gigs(String fileLocation, String newFileLocation) {
try {
Set<String> words = Files.lines(Paths.get("war and peace.txt"))
.map(s -> s.replaceAll("[^a-zA-Z\\s]", ""))
.flatMap(Pattern.compile("\\s")::splitAsStream)
.collect(Collectors.toSet());
System.out.println("words size " + words.size());//22100
Files.write(Paths.get("out.txt"), words,
StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING,
StandardOpenOption.WRITE);
} catch (IOException e) {}
}
它在0秒内完成。您不能使用 Files.lines
,除非您的巨大源文件有换行符。使用换行符时,它会逐行处理,因此不会占用太多内存。
It completed in 0 seconds. You can't use Files.lines
unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.
这篇关于删除大文本文件中的重复单词 - Java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!