删除大文本文件中的重复单词 - Java [英] Delete duplicate words in a big text file - Java

查看:685
本文介绍了删除大文本文件中的重复单词 - Java的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的文本文件大小超过50GB。
现在我要删除重复的单词。
但我听说,我需要非常多的RAM来将每个Word从文本文件加载到哈希集中。
你能告诉我一个很好的方法来删除文本文件中的每个重复单词吗?
单词格式按空格排序。

I have text file with a size of over 50gb. Now i want to delete the duplicate words. But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set. Can you tell me a very good way to delete every duplicate word from the text file? The Words are sorted by a white space, like this.

word1 word2 word3 ... ... 


推荐答案

H2答案很好,但可能有点过分。英语中的所有单词都不会超过几Mb。只需使用一套。您可以在RAnders00程序中使用它。

The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.

public static void read50Gigs(String fileLocation, String newFileLocation) {
    Set<String> words = new HashSet<>();
    try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
        Scanner scanner = new Scanner(fileInputStream);) {

        while (scanner.hasNext()) {
            String nextWord = scanner.next();
            words.add(nextWord);
        }
        System.out.println("words size "+words.size());
        Files.write(Paths.get(newFileLocation), words, 
                StandardOpenOption.CREATE, StandardOpenOption.WRITE);

    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

作为常用词的估计,我添加了这个战争与和平(来自古腾堡)

As an estimate of common words, I added this for war and peace (from gutenberg)

public static void read50Gigs(String fileLocation, String newFileLocation) {
    try {
        Set<String> words = Files.lines(Paths.get("war and peace.txt"))
                .map(s -> s.replaceAll("[^a-zA-Z\\s]", ""))
                .flatMap(Pattern.compile("\\s")::splitAsStream)
                .collect(Collectors.toSet());

        System.out.println("words size " + words.size());//22100
        Files.write(Paths.get("out.txt"), words,
                StandardOpenOption.CREATE, 
                StandardOpenOption.TRUNCATE_EXISTING,
                StandardOpenOption.WRITE);

    } catch (IOException e) {}
}

它在0秒内完成。您不能使用 Files.lines ,除非您的巨大源文件有换行符。使用换行符时,它会逐行处理,因此不会占用太多内存。

It completed in 0 seconds. You can't use Files.lines unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.

这篇关于删除大文本文件中的重复单词 - Java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆