比较相邻列表项 [英] Compare adjacent list items

查看:66
本文介绍了比较相邻列表项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个重复的文件检测器.为了确定两个文件是否重复,我计算了一个CRC32校验和.由于这可能是一项昂贵的操作,因此我只想为具有另一个大小匹配的文件的文件计算校验和.我已经按照大小对文件列表进行了排序,并且正在循环比较每个元素与其上方和下方的元素.不幸的是,由于不会分别存在上一个或下一个文件,因此开头和结尾都存在问题.我可以使用if语句解决此问题,但感觉很笨拙.这是我的代码:

I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:

    public void GetCRCs(List<DupInfo> dupInfos)
    {
        var crc = new Crc32();
        for (int i = 0; i < dupInfos.Count(); i++)
        {
            if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
            {
                dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
            }
        }
    }

我的问题是:

  1. 如何将每个条目与其相邻条目进行比较,而不会出现超出范围的错误?

  1. How can I compare each entry to its neighbors without the out of bounds error?

我应该为此使用循环,还是有更好的LINQ或其他功能?

Should I be using a loop for this, or is there a better LINQ or other function?

注意:为了避免混乱,我没有包括其余的代码.如果您想看到它,我可以将其包括在内.

Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.

推荐答案

我已按大小对文件列表进行了排序,并正在遍历 将每个元素与其上方和下方的元素进行比较.

I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it.

下一步是将文件实际按大小分组.如果您有两个以上相同大小的文件,那么比较连续的文件并不总是足够的.相反,您需要将每个文件与其他相同大小的文件进行比较.

The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.

我建议采用这种方法

  1. 使用LINQ的 .GroupBy 创建一个集合文件大小.然后.Where仅保留包含多个文件的组.

  1. Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.

在这些组中,计算CRC32校验和并将其添加到已知校验和的集合中.与先前计算的校验和进行比较.如果您需要知道哪些文件特别是重复项,则可以使用以此校验和为键的字典(您可以使用另一个GroupBy来实现.否则,只需一个简单的列表就可以检测到任何重复项.

Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.

代码可能看起来像这样:

The code might look something like this:

var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
                                      .Where(group => group.Count() > 1);

foreach (var grp in filesSetsWithPossibleDupes)
{
    var checksums = new List<CRC32CheckSum>(); //or whatever type
    foreach (var file in grp)
    {
        var currentCheckSum = crc.ComputeChecksum(file);
        if (checksums.Contains(currentCheckSum))
        {
            //Found a duplicate
        }
        else
        {
            checksums.Add(currentCheckSum);
        }
    }
}

或者,如果您需要可以重复的特定对象,则内部foreach循环可能看起来像

Or if you need the specific objects that could be duplicates, the inner foreach loop might look like

var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
                                      .Where(grp => grp.Count() > 1);

var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates

foreach (var grp in filesSetsWithPossibleDupes)
{
    var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
                              .Where(g => g.Count() > 1);
    //Same GroupBy logic, but applied to the checksum (instead of file size)

    foreach(var dupGrp in likelyDuplicates)
    {
        //Create the key for the dictionary (your code is likely different)
        var sample = dupGrp.First();
        var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
        masterDuplicateDict.Add(key, dupGrp);
    }
}

演示此想法.

这篇关于比较相邻列表项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆