解析CSV文件用C#中的引号括起来 [英] Parsing CSV File enclosed with quotes in C#

查看:167
本文介绍了解析CSV文件用C#中的引号括起来的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在解析CSV档案时看到很多范例。但这是一个恼人的文件...

I've seen lots of samples in parsing CSV File. but this one is kind of annoying file...

那么如何解析这种CSV

so how do you parse this kind of CSV

1,1/2/2010,样本(adasdad)asdada,我在门口偷偷走了,Stinky,所以我会死的,AK

"1",1/2/2010,"The sample ("adasdad") asdada","I was pooping in the door "Stinky", so I'll be damn","AK"

推荐答案

在大多数情况下,最好的答案可能是@Jim Mischel的。 TextFieldParser 似乎是你想要的大多数常规案例 - 虽然奇怪存在于 Microsoft.VisualBasic 命名空间! 但这种情况并不常见。

The best answer in most cases is probably @Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.

最后一次我遇到了这个问题的变体一些非传统的,我尴尬地放弃了regexp'ing和bullheaded一个字符的字符检查。有时,这是没有错的够做。

The last time I ran into a variation on this issue where I needed something unconvential, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.

因此,我重写了一个字符串扩展。我认为这很近。

So I rewrote for this case as a string extension. I think this is close.

请注意,我在门口偷偷臭,所以我会死的 是一个特别讨厌的情况。没有 *** STINKY CONDITION *** 代码,下面,你会得到我偷偷在门Stinky 作为一个值和所以我会死的作为另一个。

Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.

对于任何匿名的奇怪分割器/ escape case 来说,做得更好的唯一方法就是使用某种算法来确定通常列数,然后在这种情况下检查固定长度字段,例如您的 AK 状态条目或一些其他可能的标志作为一种规范化的支持不合格柱。但是这是一个严重的疯狂的逻辑,可能不被要求,尽可能多的乐趣,因为它会代码。正如@Vash指出,你最好遵循某些标准和更多的编码。

The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As @Vash points out, you're better off following some standard and coding a little more OFfensively.

但这里的问题可能更容易比起那个来说。 唯一具有语法意义的案例是您的示例中的一个 - - 双引号,逗号,然后是空格。 *** STINKY CONDITION *** 代码检查即使如此,这段代码比我想要的更糟糕,这意味着你有陌生的边缘情况,如这也是臭的,afab,现在什么?heck,甚至A,B,C现在不工作在这个代码,iirc,因为我把开始和结束字符作为已经逃脱pre和post-fixed。所以我们很大程度上回到@ Vash的评论!

But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to @Vash's comment!

对于一行如果语句的所有括号都同意,但我现在陷入了一个StyleCop世界不一定建议你使用这个 - strictEscapeToSplitEvaluation 加上STINKY CONDITION使这有点复杂,但值得记住的是一个正常的csv解析器是智能的报价是显着的

Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.

namespace YourFavoriteNamespace 
{
    using System;
    using System.Collections.Generic;
    using System.Text;

    public static class Extensions
    {
        public static Queue<string> SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"', 
            bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
        {
            Queue<string> qReturn = new Queue<string>();
            StringBuilder stringBuilder = new StringBuilder();

            bool bInEscapeVal = false;

            for (int i = 0; i < valToSplit.Length; i++)
            {
                if (!bInEscapeVal)
                {
                    // Escape values must come immediately after a split.
                    // abc,"b,ca",cab has an escaped comma.
                    // abc,b"ca,c"ab does not.
                    if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
                    {
                        bInEscapeVal = true;    // not capturing escapeChar as part of value; easy enough to change if need be.
                    }
                    else if (splittingChar == valToSplit[i])
                    {
                        qReturn.Enqueue(stringBuilder.ToString());
                        stringBuilder = new StringBuilder();
                    }
                    else
                    {
                        stringBuilder.Append(valToSplit[i]);
                    }
                }
                else
                {
                    // Can't use switch b/c we're comparing to a variable, I believe.
                    if (escapeChar == valToSplit[i])
                    {
                        // Repeated escape always reduces to one escape char in this logic.
                        // So if you wanted "I'm ""double quote"" crazy!" to come out with 
                        // the double double quotes, you're toast.
                        if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
                        {
                            i++;
                            stringBuilder.Append(escapeChar);
                        }
                        else if (!strictEscapeToSplitEvaluation)
                        {
                            bInEscapeVal = false;
                        }
                        // *** STINKY CONDITION ***  
                        // Kinda defense, since only `", ` really makes sense.
                        else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
                            valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
                        {
                            i = i+2;
                            stringBuilder.Append("\", ");
                        }
                        // *** EO STINKY CONDITION ***  
                        else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
                        {
                            bInEscapeVal = false;
                        }
                        else
                        {
                            stringBuilder.Append(escapeChar);
                        }
                    }
                    else
                    {
                        stringBuilder.Append(valToSplit[i]);
                    }
                }
            }

            // NOTE: The `captureEndingNull` flag is not tested.
            // Catch null final entry?  "abc,cab,bca," could be four entries, with the last an empty string.
            if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
            {
                qReturn.Enqueue(stringBuilder.ToString());
            }

            return qReturn;
        }
    }
}

可能值得一提的是< a href =http://stackoverflow.com/a/5905742/1028230>答案你给自己在其示例字符串中没有Stinky问题。 ; ^)

Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)

[理解我们是你问过三年后,]我会说你的例子不像这里的人那么疯狂。我可以看到想要将转义字符(在这种情况下,)作为转义字符,只有当它们是分割字符后的第一个值,开始转义,只有在分割符之前找到转义字符才会停止;在这种情况下,分隔符显然是

[Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.

如果你的csv的行是 abc,bca,cab ,我希望这意味着我们有三个值: abc bca cab

If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.

中的相同交易示例(adasdad)asdada t开始和结束单元格值不是转义字符,不一定需要加倍保持意义所以我在这里添加了 strictEscapeToSplitEvaluation 标志。

Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.

享受。; ^)

这篇关于解析CSV文件用C#中的引号括起来的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆