从记事本文件中删除重复项 [英] Removing duplicates from a notepad file

查看:392
本文介绍了从记事本文件中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们假设您有notepad文件,其中包含以下行。我必须找到部分完成的重复项。即我的下面的程序工作,并在控制台中打印结果。如果你注意到user1,user2被重复两次,应该删除它。但是我必须处理另一种情况,即它必须删除user2,user1也不做它



user1,user2

user3,user1

user1,user2

user5,user6

user2,user1





以下是该计划

 使用系统; 
使用 System.Collections.Generic;
使用 System.Linq;
使用 System.Text;
使用 System.IO;
使用 System.Collections.Generic;
命名空间 ex
{
class 计划
{
静态 void Main( string [] args)
{
string path = @ C:\ Users \Documents\Visual Studio 2010 \ Friend.txt;
StreamReader sr = new StreamReader(路径);


List< string> lines = new List< string>();
string 行;

while ((line = sr.ReadLine())!= null
{

// string [] nl = line.Split(' ');

// for(int i = 0; i< nl.Length; i ++)
// {
lines 。新增(线);
// }


}

List< string> removedduplicates = lines.Distinct()。ToList();

// string nn = removedduplicates.Join(,,removedduplicates);

foreach string item in removedduplicates)
{
Console.WriteLine(item);
}


}
}
}

解决方案

< blockquote>如果你必须处理user1,user2作为匹配user2,user1,那么你需要更有建设性。



但是..这是你的作业,所以没有代码!



首先阅读你的行,并使用Split将它们分解为一个逗号的左边,一个逗号右边的部分。使用修剪删除任何杂项空格。

对部件进行排序,使它们始终处于相同的顺序。

重建字符串,使用string.Join重新添加逗号和空格。

现在你可以删除你的重复项。


另一个选项,类似于Griff,将逐行处理而不将整个文件加载到内存中 * :

使用 File.ReadLines(路径)获取 IEnumerable< string> for输入

通过 .Distinct(IEqualityComparer< string>)传递它,它给出另一个 IEnumerable< string> 输出。

然后你可以使用 File.WriteAllLines(path,IEnumerable< string>)来制作输出文件,或使用 foreach 循环将所有行写入控制台

所以现在练习是编写一个实现 IEqualityComparer< string> 的小类。这可以将字符串拆分为部分并使用您可能拥有的关于它们的任何先验信息来检查它们是否匹配(并确保匹配的输入具有相同的HashCode)。



还有其他一些我能想到的优化,但我会把它们留作练习。



 * .Distinct() 在内部构建一个表示,为每个唯一字符串收集一个条目,但这(可能)比整个文件小得多,并且明显小于两者整个文件集合和.Distinct()内部表示。


Let us say you have notepad file where it has the following lines. I have to find the duplicates I have achieved it partially. i.e my below program works and prints the result in console. if you notice "user1, user2" is repeated twice which should be removed which it does.. However I have to handle another scenario as well that is, it has to remove "user2, user1" also which it does not do

user1, user2
user3, user1
user1, user2
user5, user6
user2, user1


below is the program

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Collections.Generic;
namespace ex
{
    class Program
    {
        static void Main(string[] args)
        {
            string path = @"C:\Users\Documents\Visual Studio 2010\Friends.txt";
            StreamReader sr = new StreamReader(path);


            List<string> lines = new List<string>();
            string line;
            
            while ((line=sr.ReadLine())!=null)
            {

               // string[] nl = line.Split(' ');

              //  for (int i = 0; i<nl.Length; i++)
               // {
                     lines.Add(line);
              //  }

               
            }

            List<string> removingduplicates = lines.Distinct().ToList();

           // string nn=removingduplicates.Join(",",removingduplicates);

            foreach (string item in removingduplicates)
            {
                Console.WriteLine(item);
            }

            
        }
    }
}

解决方案

If you have to handle "user1, user2" as matching "user2, user1", then you wiull have to be a bit more constructive.

But...this is your homework, so no code!

Start by reading your lines, and using Split to "break" them into a left-of-the-comma and a right-of-the-comma part. Use Trim to remove any miscellaneous spaces.
Sort the parts so they are always in the same order.
Rebuild your strings, using string.Join to add the comma and space back in.
Now you can remove your duplicates.


Another option, similar to Griff's, would be to process line by line without loading the whole file into memory*:
Use File.ReadLines(path) to get an IEnumerable<string> for the input
pass that through the .Distinct(IEqualityComparer<string>) which gives another IEnumerable<string> for the output.
Then you can use File.WriteAllLines(path, IEnumerable<string>) to make an output file, or use a foreach loop to write all the lines to the Console.
So now the exercise is to write a small class that implements IEqualityComparer<string>. This can split the string into the parts and use whatever a priori information you may have about them to check if they match (and ensure matching inputs have the same HashCode).

There a couple of other optimizations I can think of, but I'll leave those as "exercises".

* The .Distinct() does internally build a representation that collects one entry for each unique string, but this is (potentially) much smaller than the whole file, and definitely smaller than both the whole file collection and the .Distinct() internal representation.


这篇关于从记事本文件中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆