.NET性能:大型CSV读取,重新映射,写入重新映射 [英] .NET Performance: Large CSV Read, Remap, Write Remapped

查看:161
本文介绍了.NET性能:大型CSV读取,重新映射,写入重新映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做了一些研究,发现读取和编写多格(+ 5GB)文件的最有效方法是使用类似下面的代码:

I've done some research and found that the most efficient way for me to read and write multi-gig (+5GB) files is to use something like the following code:

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs, 256 * 1024))
using (StreamReader sr = new StreamReader(bs, Encoding.ASCII, false, 256 * 1024))
{
    StreamWriter sw = new StreamWriter(outputFile, true, Encoding.Unicode, 256 * 1024);
    string line = "";

    while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
    {
        //Try to clean csv then split
        line = Regex.Replace(line, "[\\s\\dA-Za-z][\"][\\s\\dA-Za-z]", ""); 
        string[] fields = Regex.Split(line, ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
        //I know there are libraries for this that I will switch out 
        //when I have time to create the classes as it seems they all
        //require a mapping class

        //Remap 90-250 properties
        object myObj = ObjectMapper(fields);

        //Write line
        bool success = ObjectWriter(myObj);
    }

    sw.Dispose();
}

英特尔至强2.67上3个实例的每个CPU平均CPU约为33%千兆赫。我能够在约26小时内输出2个文件,这些文件只有不到7GB而该过程运行3个实例使用:

CPU is averaging around 33% for each of 3 instances on an Intel Xeon 2.67 GHz. I was able to output 2 files in ~26 hrs that were just under 7GB while the process was running 3 instances using:

Parallel.Invoke(
    () => new Worker().DoWork(args[0]),
    () => new Worker().DoWork(args[1]),
    () => new Worker().DoWork(args[2])
);

第三个实例正在生成 MUCH 更大的文件,到目前为止,+ 34GB,我将在第3天上映,约67小时。

The third instance is generating a MUCH larger file being, so far, +34GB and am coming up on day 3, ~67 hrs in.

据我所知,我认为性能可能会提高将缓冲区降低到最佳位置 稍微

From what I've read, I think performance may be increased slightly by getting the buffer lowered to a sweet spot.

我的问题是:


  1. 根据陈述的内容,这是典型的表现吗?

  2. 除了我上面提到的,你还能看到其他任何改进吗?

  3. CSV映射和读取库是否比正则表达式快得多? / li>
  1. Based on what is stated, is this typical performance?
  2. Besides what I mentioned above, are there any other improvements you can see?
  3. Are the CSV mapping and reading libraries much faster that regex?


推荐答案

因此,首先,您应该分析您的代码以识别瓶颈。

So, first of all, you should profile your code to identify bottlenecks.

Visual Studio附带了一个内置的分析器,可以清楚地识别代码中的热点。

Visual Studio comes with a built-in profiler for this purpose, which can clearly identify hot-spots in your code.

鉴于你的进程受CPU限制,这可能证明是非常有效的。

Given that your process is CPU bound, this is likely to prove very effective.

然而,如果我不得不猜测为什么它很慢,我会想象这是因为你不会重复使用你的正则表达式。构造正则表达式相对昂贵,因此重新使用它可以看到大量的性能改进。

However, if I had to guess at why it's slow, I would imagine it's because you are not re-using your regexes. A regex is relatively expensive to construct, so re-using it can see massive performance improvements.

var regex1 = new Regex("[\\s\\dA-Za-z][\"][\\s\\dA-Za-z]", RegexOptions.Compiled);
var regex2 = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", RegexOptions.Compiled);
while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
{
    //Try to clean csv then split
    line = regex1.Replace(line, ""); 
    string[] fields = regex2.Split(line);
    //I know there are libraries for this that I will switch out 
    //when I have time to create the classes as it seems they all
    //require a mapping class

    //Remap 90-250 properties
    object myObj = ObjectMapper(fields);

    //Write line
    bool success = ObjectWriter(myObj);
}

但是,我强烈建议您使用像Linq2Csv这样的库 - 它会可能性能更高,因为它将进行几轮性能调整,并且它将处理代码不具备的边缘情况。

However, I would strongly encourage you to use a library like Linq2Csv - it will likely be more performant, as it will have had several rounds of performance tuning, and it will handle edge-cases that your code doesn't.

这篇关于.NET性能:大型CSV读取,重新映射,写入重新映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆