写入文件C#的性能 [英] Performance of Writing to File C#

查看:213
本文介绍了写入文件C#的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的情况概述:

我的任务是从文件中读取字符串,并将其重新格式化为更有用的格式。重新格式化输入后,我必须将其写入输出文件。

My task is to read strings from a file, and re-format them to a more useful format. After reformatting the input, I have to write it to a output file.

以下是必须完成的示例。
文件行示例:

Here is an Example of what has to be done. Example of File Line :

ANO=2010;CPF=17834368168;YEARS=2010;2009;2008;2007;2006 <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

此输入文件在每行上有两个重要信息: CPF 这是我将使用的文档编号,以及 XML 文件(表示在数据库上返回文档查询)。

This input file has on each line two important informations: CPF which is the document number I will use, and the XML file (that represents the return of a query for the document on a database).

我必须达到的目标:

每个文件,在此旧format 有一个 XML ,包含所有年份(2006年到2010年)的查询返回值。重新格式化后,每个输入行将转换为5个输出行:

Each Document, in this old format has an XML containing the query returns for all the years (2006 to 2010). After reformatting it, each input line is converted to 5 output lines :

CPF=17834368168;YEARS=2010; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2009; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2008; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2007; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2006; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

一行,包含有关该文档的每年信息。基本上,输出文件比输入文件长5倍。

One line, containing each year information about that document. So basically, the output files are 5 times longer than the input files.

性能问题:

每个文件有400,000行,我有133个文件要处理。

Each file has 400,000 lines, and I have 133 files to process.

目前,这是我的应用流程:

At the moment, here is the flow of my app :


  1. 打开文件

  2. 读一行

  3. 将其解析为新格式

  4. 将行写入输出文件

  5. 转到2直到没有左行

  6. 转到1直到没有左文件

  1. Open a file
  2. Read a line
  3. Parse it to the new format
  4. Write the line to the output file
  5. Goto 2 until there is no left line
  6. Goto1 until there is no left file

每个输入文件大约为700MB,读取文件并将其转换后的版本写入另一个文件需要永远。 400KB的文件需要约30秒才能完成此过程。

Each input file is about 700MB, and it is taking forever to read files and write the converted version of them to another one. A file with 400KB takes ~30 seconds to achieve the process.

额外信息:

我的机器运行在Intel i5处理器上,内存为8GB。

My machine runs on a Intel i5 processor, with 8GB RAM.

我没有实例化大量的对象以避免mem。泄漏,我在输入文件打开时使用使用子句。

I am not instantiating tons of object to avoid mem. leaking, and I'm using the using clause on input file opening.

我该怎么做才能做到运行得更快?

What can I do to make it run faster ?

推荐答案

我不知道你的代码是什么样的,但这里有一个在我的盒子上的例子(不可否认的是一个SSD和一个i7,但......)在大约50ms内处理一个400K的文件。

I don't know what your code looks like, but here's an example which on my box (admittedly with an SSD and an i7, but...) processes a 400K file in about 50ms.

我甚至都没想过要优化它 - 我写的是我能做到的最干净的方式。 (请注意,它都是懒惰的评估; File.ReadLines File.WriteAllLines 负责打开和关闭文件。 )

I haven't even thought about optimizing it - I've written it in the cleanest way I could. (Note that it's all lazily evaluated; File.ReadLines and File.WriteAllLines take care of opening and closing the files.)

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Test
{
    public static void Main()
    {
        Stopwatch stopwatch = Stopwatch.StartNew();
        var lines = from line in File.ReadLines("input.txt")
                    let cpf = ParseCpf(line)
                    let xml = ParseXml(line)
                    from year in ParseYears(line)
                    select cpf + year + xml;

        File.WriteAllLines("output.txt", lines);
        stopwatch.Stop();
        Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds);
    }

    // Returns the CPF, in the form "CPF=xxxxxx;"
    static string ParseCpf(string line)
    {
        int start = line.IndexOf("CPF=");
        int end = line.IndexOf(";", start);
        // TODO: Validation
        return line.Substring(start, end + 1 - start);
    }

    // Returns a sequence of year values, in the form "YEAR=2010;"
    static IEnumerable<string> ParseYears(string line)
    {
        // First year.
        int start = line.IndexOf("YEARS=") + 6;
        int end = line.IndexOf(" ", start);
        // TODO: Validation
        string years = line.Substring(start, end - start);
        foreach (string year in years.Split(';'))
        {
            yield return "YEARS=" + year + ";";
        }
    }

    // Returns all the XML from the leading space onwards
    static string ParseXml(string line)
    {
        int start = line.IndexOf(" <?xml");
        // TODO: Validation
        return line.Substring(start);
    }
}

这篇关于写入文件C#的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆