使用TextFieldParser处理包含未转义双引号的字段 [英] Dealing with fields containing unescaped double quotes with TextFieldParser

查看:104
本文介绍了使用TextFieldParser处理包含未转义双引号的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 TextFieldParser导入CSV文件.特定的CSV文件由于格式不标准而导致我出现问题.有问题的CSV的字段都用双引号引起来.当在特定字段中存在另一组未转义的双引号时,将出现问题.

I am trying to import a CSV file using TextFieldParser. A particular CSV file is causing me problems due to its nonstandard formatting. The CSV in question has its fields enclosed in double quotes. The problem appears when there is an additional set of unescaped double quotes within a particular field.

这里是一个过分简化的测试用例,突出了问题所在.我正在处理的实际CSV文件的格式并非都相同,并且有数十个字段,其中任何一个字段都可能包含这些可能棘手的格式问题.

Here is an oversimplified test case that highlights the problem. The actual CSV files I am dealing with are not all formatted the same and have dozens of fields, any of which may contain these possibly tricky formatting issues.

TextReader reader = new StringReader("\"Row\",\"Test String\"\n" +
    "\"1\",\"This is a test string.  It is parsed correctly.\"\n" +
    "\"2\",\"This is a test string with a comma,  which is parsed correctly\"\n" +
    "\"3\",\"This is a test string with double \"\"double quotes\"\". It is parsed correctly\"\n" +
    "\"4\",\"This is a test string with 'single quotes'. It is parsed correctly\"\n" +
    "5,This is a test string with fields that aren't enclosed in double quotes.  It is parsed correctly.\n" +
    "\"6\",\"This is a test string with single \"double quotes\".  It can't be parsed.\"");

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };
    while (!parser.EndOfData)
    {
        string[] fields= parser.ReadFields();
        Console.WriteLine("This line was parsed as:\n{0},{1}",
            fields[0], fields[1]);
    }
}

无论如何,是否可以使用TextFieldParser正确解析这种格式的CSV?

Is there anyway to properly parse a CSV with this type of formatting using TextFieldParser?

推荐答案

我同意Hans Passant的建议,即解析格式错误的数据不是您的责任.但是,根据稳健性原则,某些面对这种情况的人可能会尝试处理特定类型格式错误的数据.我在下面编写的代码适用于问题中指定的数据集.基本上,它会检测格式错误的行上的解析器错误,根据第一个字符确定它是否为双引号包装,然后手动拆分/剥离所有包装的双引号.

I agree with Hans Passant's advice that it is not your responsibility to parse malformed data. However, in accord with the Robustness Principle, some one faced with this situation may attempt to handle specific types of malformed data. The code I wrote below works on the data set specified in the question. Basically it detects the parser error on the malformed line, determines if it is double-quote wrapped based on the first character, and then splits/strips all the wrapping double-quotes manually.

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };

    while (!parser.EndOfData)
    {
        string[] fields = null;
        try
        {
            fields = parser.ReadFields();
        }
        catch (MalformedLineException ex)
        {
            if (parser.ErrorLine.StartsWith("\""))
            {
                var line = parser.ErrorLine.Substring(1, parser.ErrorLine.Length - 2);
                fields = line.Split(new string[] { "\",\"" }, StringSplitOptions.None);
            }
            else
            {
                throw;
            }
        }
        Console.WriteLine("This line was parsed as:\n{0},{1}", fields[0], fields[1]);
    }
}

我敢肯定,可以编造一个失败的病理示例(例如,在字段值内双引号附近的逗号),但是从严格意义上讲,任何这样的示例都是无法解析的,而在尽管格式不正确,但这个问题还是可以理解的.

I'm sure it is possible to concoct a pathological example where this fails (e.g. commas adjacent to double-quotes within a field value) but any such examples would probably be unparseable in the strictest sense, whereas the problem line given in the question is decipherable despite being malformed.

这篇关于使用TextFieldParser处理包含未转义双引号的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆