带有自定义 LineBreak 的 Streamreader - 性能优化 [英] Streamreader with custom LineBreak - Performance optimisation

查看:31
本文介绍了带有自定义 LineBreak 的 Streamreader - 性能优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题需要解决:我们从不同来源接收文件(主要是地址信息),这些文件可以是带有 CR/LF ('\r''\n') 作为换行符的 Windows Standard,也可以是带有 LF ('\n') 的 UNIX.

I had the following Problem to solve: We receive Files (mostly adress-Information) from different sources, these can be in Windows Standard with CR/LF ('\r''\n') as Line Break or UNIX with LF ('\n').

当使用 StreamReader.ReadLine() 方法读取文本时,这没有问题,因为它平等地处理两种情况.

When reading text in using the StreamReader.ReadLine() method, this is no Problem because it handles both cases equally.

当您在文件中某处有不应该存在的 CR 或 LF 时,就会出现问题.例如,如果您将单元格中包含 LineBreaks 的 EXCEL 文件导出为 .CSV 或其他平面文件,就会发生这种情况.

The Problem occurs when you have a CR or a LF somewhere in the File that is not supposed to be there. This happens for example if you Export a EXCEL-File with Cells that contain LineBreaks within the Cell to .CSV or other Flat-Files.

现在您有一个文件,例如具有以下结构:

Now you have a File that for example has the following structure:

FirstName;LastName;Street;HouseNumber;PostalCode;City;Country'\r''\n'
Jane;Doe;co James Doe'\n'TestStreet;5;TestCity;TestCountry'\r''\n'
John;Hancock;Teststreet;1;4586;TestCity;TestCounty'\r''\n'

现在 StreamReader.ReadLine() 方法将第一行读取为:

Now the StreamReader.ReadLine() Method reads the First Line as:

FirstName;LastName;Street;HouseNumber;PostalCode;City;Country

这很好,但第二行是:

Jane;Doe;co James Doe

这要么会破坏您的代码,要么会得到错误的结果,如下一行:

This will either break your Code or you will have false Results, as the following Line will be:

TestStreet;5;TestCity;TestCountry

所以我们通常通过一个工具运行文件,该工具检查是否有松散的 '\n' 或 '\r' 周围并删除它们.

So we usualy ran the File trough a tool that checks if there are loose '\n' or '\r' arround and delete them.

但是这一步很容易忘记,所以我尝试实现我自己的 ReadLine() 方法.要求是它能够使用一两个 LineBreak 字符,并且这些字符可以由消费逻辑自由定义.

But this step is easy to Forget and so I tried to implement a ReadLine() method of my own. The requirement was that it would be able to use one or two LineBreak characters and those characters could be defined freely by the consuming logic.

这是我想出的类:

 public class ReadFile
{
    private FileStream file;
    private StreamReader reader;

    private string fileLocation;
    private Encoding fileEncoding;
    private char lineBreak1;
    private char lineBreak2;
    private bool useSeccondLineBreak;

    private bool streamCreated = false;

    private bool endOfStream;

    public bool EndOfStream
    {
        get { return endOfStream; }
        set { endOfStream = value; }
    }

    public ReadFile(string FileLocation, Encoding FileEncoding, char LineBreak1, char LineBreak2, bool UseSeccondLineBreak)
    {
        fileLocation = FileLocation;
        fileEncoding = FileEncoding;
        lineBreak1 = LineBreak1;
        lineBreak2 = LineBreak2;
        useSeccondLineBreak = UseSeccondLineBreak;
    }

    public string ReadLine()
    {
        if (streamCreated == false)
        {
            file = new FileStream(fileLocation, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
            reader = new StreamReader(file, fileEncoding);

            streamCreated = true;
        }

        StringBuilder builder = new StringBuilder();
        char[] buffer = new char[1];
        char lastChar = new char();
        char currentChar = new char();

        bool first = true;
        while (reader.EndOfStream != true)
        {
            if (useSeccondLineBreak == true)
            {
                reader.Read(buffer, 0, 1);
                lastChar = currentChar;

                if (currentChar == lineBreak1 && buffer[0] == lineBreak2)
                {
                    break;
                }
                else
                {
                    currentChar = buffer[0];
                }

                if (first == false)
                {
                    builder.Append(lastChar);
                }
                else
                {
                    first = false;
                }
            }
            else
            {
                reader.Read(buffer, 0, 1);

                if (buffer[0] == lineBreak1)
                {
                    break;
                }
                else
                {
                    currentChar = buffer[0];
                }

                builder.Append(currentChar);
            }
        }

        if (reader.EndOfStream == true)
        {
            EndOfStream = true;
        }

        return builder.ToString();
    }

    public void Close()
    {
        if (streamCreated == true)
        {
            reader.Close();
            file.Close();
        }
    }
}

这段代码工作正常,它完成了它应该做的事情,但与原始 StreamReader.ReadLine() 方法相比,它慢了大约 3 倍.当我们使用大行计数时,差异不仅会被混淆,还会反映在现实世界的性能中.(对于 700'000 行,读取所有行需要约 5 秒,提取一个块并将其写入新文件,使用我的方法在我的系统上需要约 15 秒)

This code works fine, it does what it is supposed to do but compared to the original StreamReader.ReadLine() method, it is ~3 Times slower. As we work with large row-Counts the difference is not only messured but also reflected in real world Performance. (for 700'000 Rows it takes ~ 5 Seconds to read all Lines, extract a Chunk and write it to a new File, with my method it takes ~15 Seconds on my system)

我尝试了使用更大缓冲区的不同方法,但到目前为止我无法提高性能.

I tried different aproaches with bigger buffers but so far I wasn't able to increase Performance.

我对什么感兴趣:有什么建议可以改进此代码的性能以更接近 StreamReader.ReadLine() 的原始性能吗?

What I would be interessted in: Any suggestions how I could improve the performance of this code to get closer to the original Performance of StreamReader.ReadLine()?

对于 700'000 行,这现在需要约 6 秒(相比使用默认的 'StreamReader.ReadLine()' 约 5 秒)来执行与上述代码相同的操作.

This now takes ~6 Seconds (compared to ~5 Sec using the Default 'StreamReader.ReadLine()' ) for 700'000 Rows to do the same things as the code above does.

感谢 Jim Mischel 为我指明了正确的方向!

Thanks Jim Mischel for pointing me in the right direction!

public class ReadFile
    {
        private FileStream file;
        private StreamReader reader;

        private string fileLocation;
        private Encoding fileEncoding;
        private char lineBreak1;
        private char lineBreak2;
        private bool useSeccondLineBreak;

        const int BufferSize = 8192;
        int bufferedCount;
        char[] rest = new char[BufferSize];
        int position = 0;

        char lastChar;
        bool useLastChar;

        private bool streamCreated = false;

        private bool endOfStream;

        public bool EndOfStream
        {
            get { return endOfStream; }
            set { endOfStream = value; }
        }

        public ReadFile(string FileLocation, Encoding FileEncoding, char LineBreak1, char LineBreak2, bool UseSeccondLineBreak)
        {
            fileLocation = FileLocation;
            fileEncoding = FileEncoding;
            lineBreak1 = LineBreak1;
            lineBreak2 = LineBreak2;
            useSeccondLineBreak = UseSeccondLineBreak;
        }
 
        private int readInBuffer()
        {
            return reader.Read(rest, 0, BufferSize);
        }

        public string ReadLine()
        {
            StringBuilder builder = new StringBuilder();
            bool lineFound = false;

            if (streamCreated == false)
            {
                file = new FileStream(fileLocation, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192);

                reader = new StreamReader(file, fileEncoding);

                streamCreated = true;

                bufferedCount = readInBuffer();
            }
            
            while (lineFound == false && EndOfStream != true)
            {
                if (position < bufferedCount)
                {
                    for (int i = position; i < BufferSize; i++)
                    {
                        if (useLastChar == true)
                        {
                        useLastChar = false;

                        if (rest[i] == lineBreak2)
                        {
                            count++;
                            position = i + 1;
                            lineFound = true;
                            break;
                        }
                        else
                        {
                            builder.Append(lastChar);
                        }
                        }

                        if (rest[i] == lineBreak1)
                        {
                            if (useSeccondLineBreak == true)
                            {
                                if (i + 1 <= BufferSize - 1)
                                {
                                    if (rest[i + 1] == lineBreak2)
                                    {
                                        position = i + 2;
                                        lineFound = true;
                                        break;
                                    }
                                    else
                                    {
                                        builder.Append(rest[i]);
                                    }
                                }
                                else
                                {
                                    useLastChar = true;
                                    lastChar = rest[i];
                                }
                            }
                            else
                            {
                                position = i + 1;
                                lineFound = true;
                                break;
                            }
                        }
                        else
                        {
                            builder.Append(rest[i]);
                        }

                        position = i + 1;
                    }
                    
                }
                else
                {
                    bufferedCount = readInBuffer();
                    position = 0;
                }
            }

            if (reader.EndOfStream == true && position == bufferedCount)
            {
                EndOfStream = true;
            }

            return builder.ToString();
        }


        public void Close()
        {
            if (streamCreated == true)
            {
                reader.Close();
                file.Close();
            }
        }
    }

推荐答案

加快速度的方法是让它一次读取多个字符.例如,创建一个 4 KB 的缓冲区,将数据读入该缓冲区,然后逐个字符读取.如果您将逐个字符复制到 StringBuilder,则非常简单.

The way to speed this up would be to have it read more than one character at a time. For example, create a 4 kilobyte buffer, read data into that buffer, and then go character-by-character. If you copy character-by-character to a StringBuilder, it's pretty easy.

下面的代码显示了如何解析循环中的行.您必须将其拆分,以便它可以在调用之间保持状态,但它应该为您提供想法.

The code below shows how to parse out lines in a loop. You'd have to split this up so that it can maintain state between calls, but it should give you the idea.

const int BufferSize = 4096;
const string newline = "\r\n";

using (var strm = new StreamReader(....))
{
    int newlineIndex = 0;
    var buffer = new char[BufferSize];
    StringBuilder sb = new StringBuilder();
    int charsInBuffer = 0;
    int bufferIndex = 0;
    char lastChar = (char)-1;

    while (!(strm.EndOfStream && bufferIndex >= charsInBuffer))
    {
        if (bufferIndex > charsInBuffer)
        {
            charsInBuffer = strm.Read(buffer, 0, buffer.Length);
            if (charsInBuffer == 0)
            {
                // nothing read. Must be at end of stream.
                break;
            }
            bufferIndex = 0;
        }
        if (buffer[bufferIndex] == newline[newlineIndex])
        {
            ++newlineIndex;
            if (newlineIndex == newline.Length)
            {
                // found a line
                Console.WriteLine(sb.ToString());
                newlineIndex = 0;
                sb = new StringBuilder();
            }
        }
        else
        {
            if (newlineIndex > 0)
            {
                // copy matched newline characters
                sb.Append(newline.Substring(0, newlineIndex));
                newlineIndex = 0;
            }
            sb.Append(buffer[bufferIndex]);
        }
        ++bufferIndex;
    }
    // Might be a line left, without a newline
    if (newlineIndex > 0)
    {
        sb.Append(newline.Substring(0, newlineIndex));
    }
    if (sb.Length > 0)
    {
        Console.WriteLine(sb.ToString());
    }
}

您可以通过跟踪起始位置来优化这一点,以便当您找到一行时,您可以创建一个从 buffer[start]buffer[current],无需创建 StringBuilder.相反,您调用 String(char[], int32, int32) 构造函数.当您跨越缓冲区边界时,处理起来有点棘手.可能希望将跨越缓冲区边界作为特殊情况处理,并在这种情况下使用 StringBuilder 进行临时存储.

You could optimize this a bit by keeping track of the starting position so that when you find a line you create a string from buffer[start] to buffer[current], without creating a StringBuilder. Instead you call the String(char[], int32, int32) constructor. That's a little tricky to handle when you cross a buffer boundary. Probably would want to handle crossing the buffer boundary as a special case and use a StringBuilder for temporary storage in that case.

不过,在我让第一个版本正常工作之前,我不会理会这种优化.

I wouldn't bother with that optimization, though, until after I got this first version working.

这篇关于带有自定义 LineBreak 的 Streamreader - 性能优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆