什么是快速的过程中找到从CSV文件中重复的行? [英] What is the fast process to find the duplicate row from a Csv file?
问题描述
我有一个包含15,00,000记录的csv文件,并需要找到CSV文件中重复的行。我想,如下code
I have a csv file containing 15,00,000 records and need to find the duplicate rows from the csv file. I am trying as below code
DataTable dtUniqueDataView = dvDataView.ToTable(true, Utility.GetHeadersFromCsv(csvfilePath).Select(c => c.Trim()).ToArray());
但在这我没有得到重复的记录,它正在采取近40分钟的时间做了手术。任何一个可以表明它可以减少时间,让重复的结果集的过程。
But in this I am not getting the duplicate records and it is taking nearly 4 mins time to do the operation. Can any one suggest the process which could reduce the time and give the duplicate result set.
推荐答案
我写了HashSet的例子:
I wrote an example with Hashset:
输出(在一个csv文件15000000项):
读文件
文件不同的读在1600,6632毫秒
Output (15,000,000 entries in a csv file):
Reading File
File distinct read in 1600,6632 ms
输出(在一个csv文件30000000项):
读文件
文件不同的读在3192,1997毫秒
Output (30,000,000 entries in a csv file):
Reading File
File distinct read in 3192,1997 ms
输出(在一个csv文件45000000项):
读文件
文件不同的读在4906,0755毫秒
Output (45,000,000 entries in a csv file):
Reading File
File distinct read in 4906,0755 ms
class Program
{
static void Main(string[] args)
{
string csvFile = "test.csv";
if (!File.Exists(csvFile)) //Create a test CSV file
CreateCSVFile(csvFile, 45000000, 15000);
List<string> distinct = GetDistinct(csvFile); //Returns every line once
Console.ReadKey();
}
static List<string> GetDistinct(string filename)
{
Stopwatch sw = new Stopwatch();//just a timer
List<HashSet<string>> lines = new List<HashSet<string>>(); //Hashset is very fast in searching duplicates
HashSet<string> current = new HashSet<string>(); //This hashset is used at the moment
lines.Add(current); //Add the current Hashset to a list of hashsets
sw.Restart(); //just a timer
Console.WriteLine("Reading File"); //just an output message
foreach (string line in File.ReadLines(filename))
{
try
{
if (lines.TrueForAll(hSet => !hSet.Contains(line))) //Look for an existing entry in one of the hashsets
current.Add(line); //If line not found, at the line to the current hashset
}
catch (OutOfMemoryException ex) //Hashset throws an Exception by ca 12,000,000 elements
{
current = new HashSet<string>() { line }; //The line could not added before, add the line to the new hashset
lines.Add(current); //add the current hashset to the List of hashsets
}
}
sw.Stop();//just a timer
Console.WriteLine("File distinct read in " + sw.Elapsed.TotalMilliseconds + " ms");//just an output message
List<string> concatinated = new List<string>(); //Create a list of strings out of the hashset list
lines.ForEach(set => concatinated.AddRange(set)); //Fill the list of strings
return concatinated; //Return the list
}
static void CreateCSVFile(string filename, int entries, int duplicateRow)
{
StringBuilder sb = new StringBuilder();
using (FileStream fs = File.OpenWrite(filename))
using (StreamWriter sw = new StreamWriter(fs))
{
Random r = new Random();
string duplicateLine = null;
string line = "";
for (int i = 0; i < entries; i++)
{
line = r.Next(1, 10) + ";" + r.Next(11, 45) + ";" + r.Next(20, 500) + ";" + r.Next(2, 11) + ";" + r.Next(12, 46) + ";" + r.Next(21, 501);
sw.WriteLine(line);
if (i % duplicateRow == 0)
{
if (duplicateLine != null && i < entries - 1)
{
sb.AppendLine(duplicateLine);
i++;
}
duplicateLine = line;
}
}
}
}
}
这篇关于什么是快速的过程中找到从CSV文件中重复的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!