使用Java检查重复文件内容 [英] Check Duplicate File content using Java

查看:571
本文介绍了使用Java检查重复文件内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个150 Gb数据文件夹.其中,文件内容是任何格式(doc,jpg,png,txt等).我们需要相互检查所有文件内容,以检查是否存在重复的文件内容.如果是这样,则打印文件路径名列表.为此,首先使用ArrayList<File>存储所有文件,然后使用FileUtils.contentEquals(file1, file2)方法.当我尝试使用少量文件(文件夹)时,它可以工作,但对于此150Gb数据文件夹,则未显示任何结果.我认为首先将所有文件存储在ArrayList中会引起问题. JVM堆问题,我不确定.

We have a 150 Gb data folder. Within that, file content is any format (doc, jpg, png, txt, etc). We need to check all file content against each other to check if there are is duplicate file content. If so, then print the file path name list. For that, first I used ArrayList<File> to store all files, then used FileUtils.contentEquals(file1, file2) method. When I try it for a small amount of files(Folder) it's working but for this 150Gb data folder, it's not showing any result. I think first storing all files in an ArrayList makes the problem. JVM Heap problem, I am not sure.

任何人都有更好的建议和示例代码来处理这些数据量?请帮我.

Anyone have better advice and sample code to handle this amount of data? Please help me.

推荐答案

计算每个文件的MD5哈希,并以MD5哈希为键,文件路径为值存储在HashMap中.将新文件添加到HashMap时,可以轻松检查是否已经存在带有该MD5哈希的文件.

Calculate the MD5 hash of each file and store in a HashMap with the MD5 hash as the key and the file path as the value. When you add a new file to the HashMap, you can easily check if there is already a file with that MD5 hash.

错误匹配的可能性很小,但是如果您愿意,可以使用FileUtils.contentEquals确认匹配.

The chance of a false match is very small, but if you want you can use FileUtils.contentEquals to confirm the match.

例如:

void findMatchingFiles(List<String> filepaths)
{
    HashMap<String, String> hashmap = new HashMap<String, String>();
    for(String filepath in filepaths)
    {
        String md5 = getFileMD5(filepath); // see linked answer
        if(hashmap.containsKey(md5))
        {
             String original = hashmap.get(md5);
             String duplicate = filepath;

             // found a match between original and duplicate
        }
        else
        {
             hashmap.put(md5, filepath);
        }
    }
}

如果有多个相同的文件,则将找到每个文件与第一个文件的匹配项,但不会找到所有文件的匹配项.如果希望使用后者,则可以将MD5字符串中的哈希存储到文件路径列表中,而不仅仅是第一个.

If there are multiple identical files this will find a match of each of them with the first one, but not a match of all of them to each other. If you want the latter you can store a hash from the MD5 string to a list of filepaths instead of just to the first one.

这篇关于使用Java检查重复文件内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆