合并最大50GB的CSV文件 [英] Merge CSV files upto 50GB size

查看:77
本文介绍了合并最大50GB的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

I have to merge two CSV files of 50GB size using .net. Please help me a quick process that took less than 5 mintues





我尝试过:



static void Main(string [] args)

{



string sourceFolder = @D:\ SingleBlockDataDump_June.csv;

string destinationFile = @\ D:\ SingleBlockDataDump_July.csv;

string logFilePath = @D: \log.txt;

// string [] filePaths = Directory.GetFiles(sourceFolder,CSV_File_Number?.csv);

StreamWriter fileDest = new StreamWriter( destinationFile,true);



// int i = 1;

// for(i = 0; i< filePaths.Length ; i ++)

{

//字符串文件= filePaths [i];



string []行= File.ReadAllLines(sourceFolder); //File.ReadAllLines(file);



// if(i> 0)

// {

// lines = lines.Skip(1).ToArray(); //为第一个文件跳过标题行

lines = lines.ToArray();

//}

TimeSpan startTime = DateTime。 Now.TimeOfDay;

string logText =开始合并:+ startTime + Environment.NewLine;



foreach(行中的字符串行) )

{

fileDest.WriteLine(line);

}

TimeSpan endTime = DateTime.Now.TimeOfDay ;

logText + =完成合并:+ endTime;

// TimeSpan duration = DateTime.Parse(endTime).Subtract(DateTime.Parse(startTime));

logText + =经过时间:;

使用(StreamWriter writetext = new StreamWriter(logFilePath))

{

writetext.WriteLine(logText);

}

Console.ReadLine();

}



fileDest.Close();

}



What I have tried:

static void Main(string[] args)
{

string sourceFolder = @"D:\SingleBlockDataDump_June.csv";
string destinationFile = @"\D:\SingleBlockDataDump_July.csv";
string logFilePath = @"D:\log.txt";
// string[] filePaths = Directory.GetFiles(sourceFolder, "CSV_File_Number?.csv");
StreamWriter fileDest = new StreamWriter(destinationFile, true);

//int i=1;
//for (i = 0; i < filePaths.Length; i++)
{
//string file = filePaths[i];

string[] lines = File.ReadAllLines(sourceFolder); //File.ReadAllLines(file);

//if (i > 0)
//{
//lines = lines.Skip(1).ToArray(); // Skip header row for all but first file
lines = lines.ToArray();
//}
TimeSpan startTime = DateTime.Now.TimeOfDay;
string logText = "Started to merge: " + startTime +Environment.NewLine;

foreach (string line in lines)
{
fileDest.WriteLine(line);
}
TimeSpan endTime=DateTime.Now.TimeOfDay;
logText += "Finished merging: " + endTime;
//TimeSpan duration = DateTime.Parse(endTime).Subtract(DateTime.Parse(startTime));
logText += "Elapsed Time:";
using (StreamWriter writetext = new StreamWriter(logFilePath))
{
writetext.WriteLine(logText);
}
Console.ReadLine();
}

fileDest.Close();
}

推荐答案

CSV文件是带有标题描述的文本文件,因此如果文件结构相同,则只需将第二个文件附加到第一个文件跳过第二个文件标题行(第一行)时的文件。
CSV files are text files with a header description, so if the file structures are the same, then just append the second file to the first file while skipping the second files header line (the first line).


不要逐行阅读(甚至 ReadAllLines 如果速度很重要。



分配一个用于复制的大字节数组。大小应该低于可用的可用内存以避免交换到磁盘。



对于每个文件获取它的大小和 - 除了第一个 - 读取第一行跳过它并从长度获得第二行的偏移量。从大小中减去偏移量。



现在使用循环进行块处理:



  • 确定块大小(缓冲区大小和剩余大小的最小值)

  • 读入缓冲区

  • 写入输出文件

  • 按块大小减小大小

Don't use reading line by line (or even ReadAllLines) if speed matters.

Allocate a large byte array to be used for copying. The size should be fair below the available free memory to avoid swapping to disk.

For each file get it's size and - except for the first one - read the first line to skip it and get the offset to the second line from the length. Subtract the offset from the size.

Now use a loop for block wise processing:

  • Determine the block size (min. of buffer size and remaining size)
  • Read into buffer
  • Write to output file
  • Decrement size by block size


这篇关于合并最大50GB的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆