需要使用StreamReader.ReadLine()拾取行终止符 [英] Need to pick up line terminators with StreamReader.ReadLine()

查看:193
本文介绍了需要使用StreamReader.ReadLine()拾取行终止符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个C#程序来读取Excel .xls/.xlsx文件并输出为CSV和Unicode文本.我编写了一个单独的程序来删除空白记录.这是通过使用StreamReader.ReadLine()读取每一行,然后逐个字符地遍历字符串并在包含所有逗号(对于CSV)或所有制表符(对于Unicode文本)中不写输出行来实现的.

当Excel文件在单元格中包含嵌入式换行符(\ x0A)时,会出现问题.我将XLS更改为CSV转换器,以查找这些新行(因为它逐个单元格地写)并将它们写为\ x0A,而普通行仅使用StreamWriter.WriteLine().

该问题发生在单独的程序中,以删除空白记录.当我用StreamReader.ReadLine()读入时,根据定义,它仅返回带有行的字符串,而不返回终止符.由于嵌入的换行显示为两条单独的行,因此我无法确定将其写入最终文件时是哪个完整记录,哪个是嵌入的换行.

我什至不确定我是否可以读入\ x0A,因为输入中的所有内容都注册为'\ n'.我可以一个字一个字地走,但这破坏了我删除空白行的逻辑.

解决方案

我建议您更改体系结构,使其更像编译器中的解析器.

您要创建一个词法分析器,该词法分析器返回一个标记序列,然后一个解析器读取标记序列并对其进行填充.

在您的情况下,令牌为:

  1. 列数据
  2. 逗号
  3. 行尾

您将自己将'\ n'('\ x0a')视为嵌入的新行,因此将其包括为列数据令牌的一部分. '\ r \ n'将构成行尾标记.

这具有以下优点:

  1. 仅对数据进行1次传递
  2. 最多只能存储1行数据
  3. 重复使用尽可能多的内存(用于字符串生成器和列表)
  4. 如果您的需求有变化,很容易更改

以下是Lexer的样例:

免责声明:我什至没有编译此代码,更不用说对其进行测试了,所以您需要清理它并确保它能工作.

enum TokenType
{
    ColumnData,
    Comma,
    LineTerminator
}

class Token
{
    public TokenType Type { get; private set;}
    public string Data { get; private set;}

    public Token(TokenType type)
    {
        Type = type;
    }

    public Token(TokenType type, string data)
    {
        Type = type;
        Data = data;
    }
}

private  IEnumerable<Token> GetTokens(TextReader s)
{
   var builder = new StringBuilder();

   while (s.Peek() >= 0)
   {
       var c = (char)s.Read();
       switch (c)
       {
           case ',':
           {
               if (builder.Length > 0)
               {
                   yield return new Token(TokenType.ColumnData, ExtractText(builder));
               }
               yield return new Token(TokenType.Comma);
               break;
           }
           case '\r':
           {
                var next = s.Peek();
                if (next == '\n')
                {
                    s.Read();
                }

                if (builder.Length > 0)
                {
                    yield return new Token(TokenType.ColumnData, ExtractText(builder));
                }
                yield return new Token(TokenType.LineTerminator);
                break;
           }
           default:
               builder.Append(c);
               break;
       }

   }

   s.Read();

   if (builder.Length > 0)
   {
       yield return new Token(TokenType.ColumnData, ExtractText(builder));
   }
}

private string ExtractText(StringBuilder b)
{
    var ret = b.ToString();
    b.Remove(0, b.Length);
    return ret;
}

您的解析器"代码将如下所示:

public void ConvertXLS(TextReader s)
{
    var columnData = new List<string>();
    bool lastWasColumnData = false;
    bool seenAnyData = false;

    foreach (var token in GetTokens(s))
    {
        switch (token.Type)
        {
            case TokenType.ColumnData:
            {
                 seenAnyData = true;
                 if (lastWasColumnData)
                 {
                     //TODO: do some error reporting
                 }
                 else
                 {
                     lastWasColumnData = true;
                     columnData.Add(token.Data);
                 }
                 break;
            }
            case TokenType.Comma:
            {
                if (!lastWasColumnData)
                {
                    columnData.Add(null);
                }
                lastWasColumnData = false;
                break;
            }
            case TokenType.LineTerminator:
            {
                if (seenAnyData)
                {
                    OutputLine(lastWasColumnData);
                }
                seenAnyData = false;
                lastWasColumnData = false;
                columnData.Clear();
            }
        }
    }

    if (seenAnyData)
    {
        OutputLine(columnData);
    }
}

I wrote a C# program to read an Excel .xls/.xlsx file and output to CSV and Unicode text. I wrote a separate program to remove blank records. This is accomplished by reading each line with StreamReader.ReadLine(), and then going character by character through the string and not writing the line to output if it contains all commas (for the CSV) or all tabs (for the Unicode text).

The problem occurs when the Excel file contains embedded newlines (\x0A) inside the cells. I changed my XLS to CSV converter to find these new lines (since it goes cell by cell) and write them as \x0A, and normal lines just use StreamWriter.WriteLine().

The problem occurs in the separate program to remove blank records. When I read in with StreamReader.ReadLine(), by definition it only returns the string with the line, not the terminator. Since the embedded newlines show up as two separate lines, I can't tell which is a full record and which is an embedded newline for when I write them to the final file.

I'm not even sure I can read in the \x0A because everything on the input registers as '\n'. I could go character by character, but this destroys my logic to remove blank lines.

解决方案

I would recommend that you change your architecture to work more like a parser in a compiler.

You want to create a lexer that returns a sequence of tokens, and then a parser that reads the sequence of tokens and does stuff with them.

In your case the tokens would be:

  1. Column data
  2. Comma
  3. End of Line

You would treat '\n' ('\x0a') by its self as an embedded new line, and therefore include it as part of a column data token. A '\r\n' would constitute an End of Line token.

This has the advantages of:

  1. Doing only 1 pass over the data
  2. Only storing a max of 1 lines worth of data
  3. Reusing as much memory as possible (for the string builder and the list)
  4. It's easy to change should your requirements change

Here's a sample of what the Lexer would look like:

Disclaimer: I haven't even compiled, let alone tested, this code, so you'll need to clean it up and make sure it works.

enum TokenType
{
    ColumnData,
    Comma,
    LineTerminator
}

class Token
{
    public TokenType Type { get; private set;}
    public string Data { get; private set;}

    public Token(TokenType type)
    {
        Type = type;
    }

    public Token(TokenType type, string data)
    {
        Type = type;
        Data = data;
    }
}

private  IEnumerable<Token> GetTokens(TextReader s)
{
   var builder = new StringBuilder();

   while (s.Peek() >= 0)
   {
       var c = (char)s.Read();
       switch (c)
       {
           case ',':
           {
               if (builder.Length > 0)
               {
                   yield return new Token(TokenType.ColumnData, ExtractText(builder));
               }
               yield return new Token(TokenType.Comma);
               break;
           }
           case '\r':
           {
                var next = s.Peek();
                if (next == '\n')
                {
                    s.Read();
                }

                if (builder.Length > 0)
                {
                    yield return new Token(TokenType.ColumnData, ExtractText(builder));
                }
                yield return new Token(TokenType.LineTerminator);
                break;
           }
           default:
               builder.Append(c);
               break;
       }

   }

   s.Read();

   if (builder.Length > 0)
   {
       yield return new Token(TokenType.ColumnData, ExtractText(builder));
   }
}

private string ExtractText(StringBuilder b)
{
    var ret = b.ToString();
    b.Remove(0, b.Length);
    return ret;
}

Your "parser" code would then look like this:

public void ConvertXLS(TextReader s)
{
    var columnData = new List<string>();
    bool lastWasColumnData = false;
    bool seenAnyData = false;

    foreach (var token in GetTokens(s))
    {
        switch (token.Type)
        {
            case TokenType.ColumnData:
            {
                 seenAnyData = true;
                 if (lastWasColumnData)
                 {
                     //TODO: do some error reporting
                 }
                 else
                 {
                     lastWasColumnData = true;
                     columnData.Add(token.Data);
                 }
                 break;
            }
            case TokenType.Comma:
            {
                if (!lastWasColumnData)
                {
                    columnData.Add(null);
                }
                lastWasColumnData = false;
                break;
            }
            case TokenType.LineTerminator:
            {
                if (seenAnyData)
                {
                    OutputLine(lastWasColumnData);
                }
                seenAnyData = false;
                lastWasColumnData = false;
                columnData.Clear();
            }
        }
    }

    if (seenAnyData)
    {
        OutputLine(columnData);
    }
}

这篇关于需要使用StreamReader.ReadLine()拾取行终止符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆