使用LINQ在目录中查找重复文件 [英] Find duplicate files in a directory using LINQ

查看:96
本文介绍了使用LINQ在目录中查找重复文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在编写一个程序,该程序可以从用户提供给定参数的各种来源大量下载图像。

I'm currently writing a program that mass downloads images from various sources with given parameters from the user.

我的问题是我不希望重复发生。
我应该指出,我一次处理的最大下载量为100次(不是那么大),而且每个文件都有不同的名称,因此仅按文件名搜索是行不通的,我需要

My issue is that I don't want duplicates to happen. I should point out that I'm dealing with mass downloads of 100 max at a time (not so massive), and that each file has a different name, so simply searching by file name wouldn't work, I need to check hashes.

无论如何,这就是我已经发现的内容:

Anyways, here's what I've already found:

Directory.GetFiles(FullPath)
    .Select(f => new
        {
            FileName = f,
            FileHash = Encoding.UTF8.GetString(new SHA1Managed().ComputeHash(new FileStream(f, FileMode.Open, FileAccess.Read)))
        })
    .GroupBy(f => f.FileHash)
    .Select(g => new { FileHash = g.Key, Files = g.Select(z => z.FileName).ToList() })
    .SelectMany(f => f.Files.Skip(1))
    .ToList()
    .ForEach(File.Delete);

我的问题是,在 File.Delete行上,出现了这么著名的错误该文件已被另一个进程使用。我认为这是因为上面的代码缺少在删除文件之前关闭用于获取FileHash的FileStream的方法,但是我不知道如何解决该问题,有什么想法吗?

My issue is that on the "File.Delete" line, I get the oh so famous error that the file is already in use by another process. I think this is because the code above lacks a way to close the FileStream it's using to get the FileHash before Deleting the file, but I don't know how to resolve that, any ideas ?

我还应该指出,我已经尝试过其他解决方案,例如这样的解决方案(不使用linq): https://www.bhalash.com/archives/13544802709
将打印功能替换为删除功能,没有错误,但不起作用。

I should also point out I've tried other solutions, like this one (without linq): https://www.bhalash.com/archives/13544802709 Replacing the print function with a delete one, no errors but doesn't work.

预先感谢,我随时可以提供所需的其他信息! :)

Thanks in advance, I stay available for any additional information required! :)

Akitake

推荐答案

您忘记了处置 FileStream ,因此在GC收集对象之前,文件仍处于打开状态。

You forgot to dispose the FileStream, so the file is still open until the GC collects the object.

您可以将 Select 子句替换为:

.Select(f => {
    using (var fs = new FileStream(f, FileMode.Open, FileAccess.Read))
    {
        return new
        {
            FileName = f,
            FileHash = BitConverter.ToString(SHA1.Create().ComputeHash(fs))
        });
    }
})

不要不要使用 Encoding.UTF8 编码任意字节(哈希是),因为结果可能是无效的UTF8序列。如果需要,请使用 BitConverter.ToString ,或者更好的方法:找到一种不涉及字符串的其他方式。

Do NOT use Encoding.UTF8 to encode arbitrary bytes (which a hash is), as the result could be an invalid UTF8 sequence. Use BitConverter.ToString if you must, or better yet: find a different way which does not involve strings.

例如,您可以这样写:

.Select(f => {
    // Same as above, but with:
    // FileHash = SHA1.Create().ComputeHash(fs)
})
.GroupBy(f => f.FileHash, StructuralComparisons.StructuralEqualityComparer)






不过,您可以使用更好的方法:可以按大小对文件进行分组首先,如果有多个大小相同的文件,则仅计算散列值 。当重复次数不多时,该方法应该会更好。


You may use a better approach though: you may group the files by size first, and calculate the hash only if there are multiple files with the same size. That should perform better when there are not many duplicates.

这篇关于使用LINQ在目录中查找重复文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆