使用Java删除文件中的重复行 [英] Deleting duplicate lines in a file using Java

查看:159
本文介绍了使用Java删除文件中的重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我正在开发的项目的一部分,我想清理一个生成重复行条目的文件。然而,这些副本通常不会彼此相邻发生。我想出了一种在Java中这样做的方法(它基本上做了一个文件的副本,然后使用一个嵌套的while语句来比较一个文件中的每一行与其他文件的其余部分)。问题是,我生成的文件很大,文本很重(约225k行文字,大约40字)。我估计我目前的进程需要63个小时!这绝对不能接受。

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

然而,我需要一个集成的解决方案。最好在Java中。有任何想法吗?谢谢!

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

推荐答案

嗯... 40 megs看起来足够小,你可以建立一个 Set 的行,然后再打印出来。这将是比O(n 2 )I / O工作更快的方式。

Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.

这将是这样的(忽略异常)

It would be something like this (ignoring exceptions):

public void stripDuplicatesFromFile(String filename) {
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
    String line;
    while ((line = reader.readLine()) != null) {
        lines.add(line);
    }
    reader.close();
    BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
    for (String unique : lines) {
        writer.write(unique);
        writer.newLine();
    }
    writer.close();
}

如果订单很重要,可以使用 LinkedHashSet 而不是 HashSet 。由于元素通过引用存储,所以与实际数据量相比,额外链接列表的开销应该是微不足道的。

If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

编辑:正如研讨会亚历克斯指出,如果您不介意制作临时文件,您可以在阅读时简单地打印出行。这允许您使用简单的 HashSet 而不是 LinkedHashSet 。但是我怀疑你会注意到像这样的I / O绑定操作的区别。

As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

这篇关于使用Java删除文件中的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆