什么是快速的过程中找到从CSV文件中重复的行? [英] What is the fast process to find the duplicate row from a Csv file?

查看:1918
本文介绍了什么是快速的过程中找到从CSV文件中重复的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含15,00,000记录的csv文件,并需要找到CSV文件中重复的行。我想,如下code

I have a csv file containing 15,00,000 records and need to find the duplicate rows from the csv file. I am trying as below code

DataTable dtUniqueDataView = dvDataView.ToTable(true, Utility.GetHeadersFromCsv(csvfilePath).Select(c => c.Trim()).ToArray());

但在这我没有得到重复的记录,它正在采取近40分钟的时间做了手术。任何一个可以表明它可以减少时间,让重复的结果集的过程。

But in this I am not getting the duplicate records and it is taking nearly 4 mins time to do the operation. Can any one suggest the process which could reduce the time and give the duplicate result set.

推荐答案

我写了HashSet的例子:

I wrote an example with Hashset:

输出(在一个csv文件15000000项):

读文件
文件不同的读在1600,6632毫秒

Output (15,000,000 entries in a csv file): Reading File File distinct read in 1600,6632 ms

输出(在一个csv文件30000000项):

读文件
文件不同的读在3192,1997毫秒

Output (30,000,000 entries in a csv file): Reading File File distinct read in 3192,1997 ms

输出(在一个csv文件45000000项):

读文件
文件不同的读在4906,0755毫秒

Output (45,000,000 entries in a csv file): Reading File File distinct read in 4906,0755 ms

class Program
{
    static void Main(string[] args)
    {
        string csvFile = "test.csv";
        if (!File.Exists(csvFile)) //Create a test CSV file
            CreateCSVFile(csvFile, 45000000, 15000);
        List<string> distinct = GetDistinct(csvFile); //Returns every line once

        Console.ReadKey();
    }
    static List<string> GetDistinct(string filename)
    {
        Stopwatch sw = new Stopwatch();//just a timer
        List<HashSet<string>> lines = new List<HashSet<string>>(); //Hashset is very fast in searching duplicates
        HashSet<string> current = new HashSet<string>(); //This hashset is used at the moment
        lines.Add(current); //Add the current Hashset to a list of hashsets
        sw.Restart(); //just a timer
        Console.WriteLine("Reading File"); //just an output message
        foreach (string line in File.ReadLines(filename))
        {
            try
            {
                if (lines.TrueForAll(hSet => !hSet.Contains(line))) //Look for an existing entry in one of the hashsets
                    current.Add(line); //If line not found, at the line to the current hashset
            }
            catch (OutOfMemoryException ex) //Hashset throws an Exception by ca 12,000,000 elements
            {
                current = new HashSet<string>() { line }; //The line could not added before, add the line to the new hashset
                lines.Add(current); //add the current hashset to the List of hashsets
            }
        }
        sw.Stop();//just a timer
        Console.WriteLine("File distinct read in " + sw.Elapsed.TotalMilliseconds + " ms");//just an output message
        List<string> concatinated = new List<string>(); //Create a list of strings out of the hashset list
        lines.ForEach(set => concatinated.AddRange(set)); //Fill the list of strings
        return concatinated; //Return the list
    }
    static void CreateCSVFile(string filename, int entries, int duplicateRow)
    {
        StringBuilder sb = new StringBuilder();
        using (FileStream fs = File.OpenWrite(filename))
        using (StreamWriter sw = new StreamWriter(fs))
        {
            Random r = new Random();
            string duplicateLine = null;
            string line = "";
            for (int i = 0; i < entries; i++)
            {
                line = r.Next(1, 10) + ";" + r.Next(11, 45) + ";" + r.Next(20, 500) + ";" + r.Next(2, 11) + ";" + r.Next(12, 46) + ";" + r.Next(21, 501);
                sw.WriteLine(line);
                if (i % duplicateRow == 0)
                {
                    if (duplicateLine != null && i < entries - 1)
                    {
                        sb.AppendLine(duplicateLine);
                        i++;
                    }
                    duplicateLine = line;
                }
            }
        }
    }
}

这篇关于什么是快速的过程中找到从CSV文件中重复的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆