.NET 的 CSV 解析选项 [英] CSV Parsing Options with .NET

查看:17
本文介绍了.NET 的 CSV 解析选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查看基于 MS 堆栈的分隔文件(例如 CSV、制表符分隔等)解析选项,特别是 .net.我唯一排除的技术是 SSIS,因为我已经知道它不能满足我的需求.

I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs.

所以我的选择似乎是:

  1. Regex.Split
  2. TextFieldParser
  3. OLEDB CSV 解析器

我必须满足两个条件.首先,给定以下文件,其中包含两个逻辑数据行(总共五个物理行):

I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether):

101,鲍勃,保持他的房子干净".
需要洗衣服."
102,艾米,太棒了.
驱动.
勤奋."

解析结果必须产生两个逻辑行",每行由三个字符串(或列)组成.第三行/列字符串必须保留换行符!换句话说,由于未关闭"文本限定符,解析器必须识别行何时继续"到下一个物理行.

The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier.

第二个标准是每个文件的分隔符和文本限定符必须是可配置的.这里有两个字符串,取自不同的文件,我必须能够解析:

The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse:

var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";

字符串first"的正确解析是:

A proper parsing of string "first" would be:

  • 这个
  • 是,A,记录
  • 他们说,不能"
  • _
  • _
  • 成为
  • 正确
  • 解析
  • 根本

'_' 仅表示捕获了一个空白 - 我不希望出现文字下划线.

The '_' simply means that a blank was captured - I don't want a literal underbar to appear.

可以对要解析的平面文件做出一个重要假设:每个文件的列数是固定的.

One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file.

现在深入了解技术选项.

Now for a dive into the technical options.

正则表达式

首先,许多响应者评论正则表达式不是实现目标的最佳方式".但是,我确实找到了一个 评论员CSV 正则表达式:

First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex:

var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();

应用于字符串first"的结果非常棒:

The results, applied to string "first," are quite wonderful:

  • 这个"
  • 是,A,记录"
  • 那个不能",他们说,
  • ""
  • _
  • 成为"
  • 正确
  • 解析"
  • 根本

如果引号被清理干净就好了,但我可以轻松地将其作为后期处理步骤处理.否则,此方法可用于解析示例字符串first"和second",前提是相应地针对波浪号和管道符号修改了正则表达式.太棒了!

It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent!

但真正的问题与多行标准有关.在将正则表达式应用于字符串之前,我必须从文件中读取完整的逻辑行".不幸的是,除非我有正则表达式/状态机,否则我不知道要读取多少物理行才能完成逻辑行.

But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine.

所以这变成了一个先有鸡还是先有蛋"的问题.我最好的选择是将整个文件作为一个巨大的字符串读入内存,然后让正则表达式整理出多行(我没有检查上面的正则表达式是否可以处理).如果我有一个 10 gig 文件,这可能有点不稳定.

So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious.

进入下一个选项.

TextFieldParser

三行代码将使此选项的问题显而易见:

Three lines of code will make the problem with this option apparent:

var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;

分隔符配置看起来不错.然而,HasFieldsEnclosedInQuotes"是游戏结束".我很惊讶分隔符可以任意配置,但相比之下,除了引号之外,我没有其他限定符选项.请记住,我需要文本限定符的可配置性.再说一遍,除非有人知道 TextFieldParser 配置技巧,否则游戏就结束了.

The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over.

OLEDB

一位同事告诉我这个选项有两个主要缺点.首先,它对于大型(例如 10 gig)文件的性能很差.其次,我被告知,它猜测输入数据的数据类型,而不是让您指定.不好.

A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good.

帮助

所以我想知道我错了的事实(如果有的话),以及我错过的其他选项.也许有人知道一种陪审团绑定 TextFieldParser 以使用任意分隔符的方法.也许 OLEDB 已经解决了上述问题(或者可能从未解决过这些问题?).

So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?).

你说什么?

推荐答案

您是否尝试搜索已经存在的 .NET CSV 解析器?这个声称处理多行记录的速度明显快于 OLEDB.

Did you try searching for an already-existing .NET CSV parser? This one claims to handle multi-line records significantly faster than OLEDB.

这篇关于.NET 的 CSV 解析选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆