使用.NET CSV解析选项 [英] CSV Parsing Options with .NET

查看:298
本文介绍了使用.NET CSV解析选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看着我的分隔文件(如CSV,制表符分隔,等等)基于MS解析选项堆在一般情况下,和.net明确。我排除的唯一技术是SSIS,因为我已经知道它不会满足我的需求。

I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs.

所以,我的选择似乎是:

So my options appear to be:

  1. Regex.Split
  2. <一个href="http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx">TextFieldParser
  3. <一个href="http://www.switchonthe$c$c.com/tutorials/csharp-tutorial-using-the-built-in-oledb-csv-parser">OLEDB CSV分析器
  1. Regex.Split
  2. TextFieldParser
  3. OLEDB CSV Parser

我有两个标准,我必须满足。首先,鉴于以下文件,其中包含数据的两个逻辑行(五物理行共):

I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether):

101,鲍勃,保持自己的房子,干净。
需要工作的洗衣房。
102,艾米,辉煌。
驱动。
勤奋。

101, Bob, "Keeps his house ""clean"".
Needs to work on laundry."
102, Amy, "Brilliant.
Driven.
Diligent."

该分析结果必须产生两个逻辑行,由三个字符串(或列)每个。第三行/列字符串必须preserve新行!换个说法就是,解析器必须承认,当行被继续进入下一个物理行,由于未关闭文本识别符。

The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier.

第二判据是定界符和文本限定符必须是可配置的,每个文件。这里有两个字符串,从不同的文件拍摄,我必须能够解析:

The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse:

var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";

字符串第一的正确解析是:

A proper parsing of string "first" would be:

  • 是,A,记录
  • 在那不能,他们说,
  • _
  • _
  • 正确
  • 解析
  • 在所有

在'_'只是意味着一个空白被抓获 - 我不希望出现一个文字下划线。

The '_' simply means that a blank was captured - I don't want a literal underbar to appear.

一个重要的假设,可以对于该平面文件要解析:会有每个文件的列的固定数量的

One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file.

现在的潜入技术选择。

正则表达式

首先,许多响应者评论说,正则表达式是不是最好的方式来实现这一目标。我没有,但是,找到谁提供了一个极好的CSV正则表达式评议>:

First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex:

var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();

结果,适用于字符串第一,是相当精彩:

The results, applied to string "first," are quite wonderful:

  • 在本
  • 是,A,记录
  • 在这,不能,他们说,
  • _
  • 在是
  • 正确
  • 在已解析
  • 在所有

这将是很好,如果报价被清理,但我可以很容易地与处理作为后处理步骤。否则,这种方法可以用来分析两个样本串第一,第二所提供的正则表达式被修改为波浪线和管道符号相应。太棒了!

It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent!

但是,真正的问题涉及多行的标准。在一个正则表达式可以应用到一个字符串,我必须从文件中读取完整的逻辑行。不幸的是,我不知道有多少物理行读取,完成逻辑行,除非我有一个正则表达式/状态机。

But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine.

于是就变成了鸡和蛋的问题。我最好的选择是将整个文件读入内存为一个巨大的字符串,并让正则表达式整理出多行(我没有检查,如果上述正则表达式可以搞定)。如果我有一个10演出文件,这可能是一个有点precarious。

So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious.

在到下一个选项。

TextFieldParser

三线code将与该选项明显的问题:

Three lines of code will make the problem with this option apparent:

var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;

分隔符的配置看起来不错。然而,HasFieldsEnclosedInQuotes是游戏结束了。我惊呆了,该分隔符是任意配置的,但相反我比其他报价没有其它限定词的选项。请记住,我需要可配置在文本识别符。如此反复,除非有人知道一个TextFieldParser配置的把戏,这是比赛结束了。

The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over.

OLEDB

一个同事告诉我,该选项有两个主要缺陷。首先,它有可怕的性能大(如10演出)文件。二,所以我说,这猜测输入数据的数据类型,而不是让你指定。不好。

A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good.

帮助

所以,我想知道我错了的事实(如果有的话),而且我错过了其他的选择。也许有人知道一种方法,陪审团钻机TextFieldParser使用任意分隔符。也许OLEDB解决了这个问题,说明(或者是从未有过他们吗?)。

So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?).

怎么说你们?

推荐答案

你尝试寻找一个已经存在的.NET的 CSV解析器这其中声称处理多行显著记录比OLEDB更​​快。

Did you try searching for an already-existing .NET CSV parser? This one claims to handle multi-line records significantly faster than OLEDB.

这篇关于使用.NET CSV解析选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆