请帮助我处理这2个循环,以删除重复项 [英] please help me with this 2 loops, to remove duplicates

查看:80
本文介绍了请帮助我处理这2个循环,以删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

arrfiles   = Directory.GetFiles(C: , "*.txt" , SearchOption.AllDirectories); //now i have an array of strings of all paths

//error here, i have arrfiles array string with all paths of files, i want to add to list only the files with the same name, something wrong in these 2 loops 

 for (int i = 0; i < arrfiles.Length; i++) {
       for (int j = 1; j < arrfiles.Length; j++) {
       ...
       }
}

推荐答案

您的问题是您的循环无法按您认为的方式工作.
您的内循环始终从一个开始,因此它总是检查相同的文件列表
当我== 1时会怎样?您检查的第一项是j == 1,它本身就是.很自然地,它匹配.因此,除了第一个文件外,每个文件都会找到至少一个重复文件.

为什么不使用更好的结构?
Your problem is that your loops don''t work the way you think they do.
You inner loop always starts at one, so it always checks the same files list
What happens when i == 1? the first item you check it against is j == 1 which is itself. So naturally, it matches. So every file except the first will find at least one duplicate.

Why not use a better structure?
string[] files = Directory.GetFiles(@"C:/example", "*.txt", SearchOption.AllDirectories);
Dictionary<string, string> unique = new Dictionary<string, string>();
foreach (string path in files)
    {
    string file = Path.GetFileName(path);
    if (!unique.ContainsKey(file))
        {
        unique.Add(file, path);
        }
    }





OP中的更新代码:





Updated code from OP:

 for (int i = 0; i < arrfiles.Length; i++) {
    for (int j = 0; j < arrfiles.Length -2; j++) {
        fi = new FileInfo(arrfiles[i]); //Console.WriteLine("arrfiles[i]" +fi.Length.ToString());
        fi2 = new FileInfo(arrfiles[j]); // Console.WriteLine("arrfiles[j]"+fi2.Length.ToString());
        Console.ReadLine();
        if (Path.GetFileName(arrfiles[i]) == Path.GetFileName(arrfiles[j]) && i != j))
        {  arrfilenames.Add(arrfiles[j]);//here this list must include all duplicates  }
    }
}



首先想到的是它没有找到所有重复项:



This first thing that spring into mind is that it doesn''t find all duplicates:

Inputs:
   D:\Temp\ContactsShort.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt
   D:\Temp\Tip - enum in combobox.txt
   D:\Temp\New folder\ListSQL Server databases.txt
   D:\Temp\New folder\myFile.txt
   D:\Temp\New folder\myFileCopied.txt
   D:\Temp\New folder\MyLargeTextFile.txt
Outputs:
   D:\Temp\New folder\ListSQL Server databases.txt
   D:\Temp\New folder\myFile.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt

它错过了两个文件的新文件夹"版本.还有一个问题是,如果给它三个同名文件,则会报告四个...

其次,效率不是很高-您会一遍又一遍地进行相同的比较.为什么以j从零开始?无论如何,您已经检查了i的当前值之前的所有组合.
因此,稍后再开始内部循环,并丢弃i和j相同的测试.找到匹配项后,再报告两个文件. (为了清楚起见,我已经抛弃了FileInfo的内容)

It has missed out the "New Folder" version of two of the files. There is also the problem that if you give it three files of the same name it reports four...

Secondly, it''s not very efficient - you keep making the same comparisons over and over again. Why start with j at zero? you have already checked all the combinations before the current value of i anyway.
So, start the inner loop later, and throw away the test for i and j being the same. Then report both files when you find a match. (I have thrown out the FileInfo stuff, just for clarity)

for (int i = 0; i < files.Length; i++)
    {
    for (int j = i + 1; j < files.Length; j++)
        {
        if (Path.GetFileName(files[i]) == Path.GetFileName(files[j]))
            {
            arrfilenames.Add(files[i]);
            arrfilenames.Add(files[j]);
            }
        }
    }

如果需要,您还可以提前一个迭代结束外循环而不会丢失数据.
实际上,这假设有一个重复项,并且始终会一起报告成对的文件(因此,它与您的第二个问题相同.它还内置了一些效率低下的问题.解决方案是在报告条目时删除条目:

You can also end the outer loop one iteration earlier without losing data if you want to.
As it is, this assumes a single duplication, and will always report pairs of files together (so it has the same second problem as yours. It also has some inefficiency built in. The solution is to remove entries when you have reported them:

for (int i = 0; i < files.Length - 1; i++)
    {
    bool reportNeeded = false;
    string checkFile = Path.GetFileName(files[i]);
    for (int j = i + 1; j < files.Length; j++)
        {
        if (files[j] != null && Path.GetFileName(files[j]) == checkFile)
            {
            reportNeeded = true;
            arrfilenames.Add(files[j]);
            files[j] = null;
            }
        }
    if (reportNeeded)
        {
        arrfilenames.Add(files[i]);
        }
    }

(我将文件夹重命名以使其更具可读性)

(I renamed the folders to make it a bit more readable)

Inputs:
   D:\Temp\ContactsShort.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\myFile.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\MyLargeTextFile.txt
   D:\Temp\Tip - enum in combobox.txt
   D:\Temp\F 1\ListSQL Server databases.txt
   D:\Temp\F 1\myFile.txt
   D:\Temp\F 1\myFileCopied.txt
   D:\Temp\F 1\MyLargeTextFile.txt
   D:\Temp\F 2\ListSQL Server databases.txt
   D:\Temp\F 2\myFile.txt
   D:\Temp\F 2\myFileCopied.txt
   D:\Temp\F 2\MyLargeTextFile.txt
Outputs:
   D:\Temp\F 1\ListSQL Server databases.txt
   D:\Temp\F 2\ListSQL Server databases.txt
   D:\Temp\ListSQL Server databases.txt
   D:\Temp\F 1\myFile.txt
   D:\Temp\F 2\myFile.txt
   D:\Temp\myFile.txt
   D:\Temp\F 1\myFileCopied.txt
   D:\Temp\F 2\myFileCopied.txt
   D:\Temp\myFileCopied.txt
   D:\Temp\F 1\MyLargeTextFile.txt
   D:\Temp\F 2\MyLargeTextFile.txt
   D:\Temp\MyLargeTextFile.txt


使用以下代码,您可以轻松掌握重复,唯一,删除等信息.

With the following code you have all in your hand to cover dups, unique, remove, etc.

private static Dictionary<string, List<string>> GetFiles(string rootPath, string filePattern)
{
    var rawdata = Directory.GetFiles(rootPath, filePattern, SearchOption.AllDirectories);
    Dictionary<string, List<string>> files = new Dictionary<string, List<string>>();
    foreach (var path in rawdata)
    {
        string name = Path.GetFileName(path).ToLower();
        List<string> entries;
        if (files.TryGetValue(name, out entries)) entries.Add(path);
        else files.Add(name, new List<string>() { path });
    }
    return files;
}



例如.您可以运行此主程序进行测试:



E.g. you could run this main program to test it:

static void Main(string[] args)
{
    var files = GetFiles(@"c:\examples", "*.txt");
    var duplicates = from e in files.Values where e.Count > 1 select e;
    var uniques = from e in files.Values where e.Count == 1 select e;

    Func<int, List<string>, int> print = (a, e) =>
      { Console.WriteLine("{0}.\t{1}", ++a, string.Join("\n\t", e)); return a; };

    int pos = 0;
    Console.WriteLine("Duplicates:");
    pos = duplicates.Aggregate(pos, print);
    Console.WriteLine("Unique entries:");
    pos = uniques.Aggregate(pos, print);
}



干杯

Andi



Cheers

Andi


我希望代码是不言自明的,并且可以解决问题.
I hope the code is self-explanatory and it solves the problem.
string[] searchLocations = new string[] { @"C:\" };
string[] searchLocations = new string[] { @"C:\" };
string searchFilter = "*.txt";

// Since the number of files can range from a few hundred to hundreds of thousands
// sorted lists offer faster searching for keys, since it's sorted
SortedList<string, List<FileInfo>> searchedList = new SortedList<string, List<FileInfo>>();
// Since duplicates are going to be lesser, a sorted list is not used
// use SortedSet<string> if required
List<string> duplicatesList = new List<string>();

foreach (string location in searchLocations)
{
  if (Directory.Exists(location))
  {
    try
    {
      string[] matchedFiles = Directory.GetFiles(location, searchFilter, SearchOption.AllDirectories);
      if ((matchedFiles != null) && (matchedFiles.Length > 0))
      {
        foreach (string currentFile in matchedFiles)
        {
          if (File.Exists(currentFile))
          {
            // Core Logic: Group all the files based on the file name (without the extension) while keeping track of duplicates
            // Solution: Check whether a key already exists with the current file name. 
            //           If key exists, possible duplicate. Therefore mark the file name as a duplicate.
            //           If not, create a List<FileInfo> under the file name.
            //           Then add the FileInfo object of the current file to a list under the file name regardless whether it's a duplicate.
            //           ----------------------------------------------
            //           When the loop is complete, you got a list of duplicate file names and a all files grouped by file names.
            //           Deal with it as you please :)

            FileInfo currentFileInfo = new FileInfo(currentFile);
            if (searchedList.ContainsKey(currentFileInfo.Name))
            {
              if (!duplicatesList.Contains(currentFileInfo.Name))
                duplicatesList.Add(currentFileInfo.Name);
            }
            else
              searchedList.Add(currentFileInfo.Name, new List<FileInfo>());

            searchedList[currentFileInfo.Name].Add(currentFileInfo);
          }
        }

        // Releasing memory taken by the array, we no longer need this
        matchedFiles = null;
      }
    }
    catch
    {
      // Directory.GetFiles() method can throw loads of exceptions
      // Do the necessary handling as fit :)
    }
  }
}

foreach (string fileName in duplicatesList)
{
  // The list of duplicates for that file name
  List<FileInfo> duplicates = searchedList[fileName];
}
}


这篇关于请帮助我处理这2个循环,以删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆