解析TSV文件 [英] Parse a TSV file
问题描述
我需要解析TSV格式的文件(制表符分隔的值).我使用正则表达式将文件分解为每一行,但是找不到令人满意的文件来解析每一行. 现在,我提出了这一点:
I need to parse a file in TSV format (tab separated values). I use a regex to break down the file into each line, but I cannot find a satisfying one to parse each line. For now I've come up this:
(?<g>("[^"]+")+|[^\t]+)
但是,如果该行中的一个项目具有两个以上的连续双引号,那么它将不起作用.
But it does not work if an item in the line has more than 2 consecutive double quotes.
这是文件格式的方式:每个元素都由列表隔开.如果项目包含选项卡,则将其用双引号引起来.如果项目包含双引号,则将其加倍.但是有时某个元素包含4个连续的双引号,并且上述正则表达式将其拆分为2个不同的双引号.
Here's how the file is formatted: each element is separated by a tabulation. If an item contains a tab, it is encased with double quotes. If an item contains a double quote, it is doubled. But sometimes an element contains 4 conscutive double quotes, and the above regex splits the element into 2 different ones.
示例:
item1ok项目""2""OK"
item1ok "item""2""oK"
正确解析为2个元素: item1ok 和 item"2" ok (不必要的引号),但:
is correctly parsed into 2 elements: item1ok and item"2"ok (after trimming of the unnecessary quotes), but:
item1oK"item"""2oK"
item1oK "item""""2oK"
分为3个元素: item1ok , item 和 "2ok (再次修剪后).
is parsed into 3 elements: item1ok, item and "2ok (after trimming again).
有人知道如何使正则表达式适合这种情况吗?还是有其他解决方案可以简单地解析TSV? (我正在用C#进行此操作.)
Has anyone an idea how to make the regex fit this case? Or is there another solution to parse TSV simply? (I'm doing this in C#).
推荐答案
You could use the TextFieldParser. This is technically a VB assembly, but you can use it even in C# by referencing the Microsoft.VisualBasic.FileIO
assembly.
上面链接中的示例甚至显示了在制表符分隔的文件中使用它的情况.
The example at the link above even shows using it on a tab-separated file.
这篇关于解析TSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!