解析TSV文件 [英] Parse a TSV file

查看:148
本文介绍了解析TSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析TSV格式的文件(制表符分隔的值).我使用正则表达式将文件分解为每一行,但是找不到令人满意的文件来解析每一行. 现在,我提出了这一点:

I need to parse a file in TSV format (tab separated values). I use a regex to break down the file into each line, but I cannot find a satisfying one to parse each line. For now I've come up this:

(?<g>("[^"]+")+|[^\t]+)

但是,如果该行中的一个项目具有两个以上的连续双引号,那么它将不起作用.

But it does not work if an item in the line has more than 2 consecutive double quotes.

这是文件格式的方式:每个元素都由列表隔开.如果项目包含选项卡,则将其用双引号引起来.如果项目包含双引号,则将其加倍.但是有时某个元素包含4个连续的双引号,并且上述正则表达式将其拆分为2个不同的双引号.

Here's how the file is formatted: each element is separated by a tabulation. If an item contains a tab, it is encased with double quotes. If an item contains a double quote, it is doubled. But sometimes an element contains 4 conscutive double quotes, and the above regex splits the element into 2 different ones.

示例:

item1ok项目""2""OK"

item1ok "item""2""oK"

正确解析为2个元素: item1ok item"2" ok (不必要的引号),但:

is correctly parsed into 2 elements: item1ok and item"2"ok (after trimming of the unnecessary quotes), but:

item1oK"item"""2oK"

item1oK "item""""2oK"

分为3个元素: item1ok item "2ok (再次修剪后).

is parsed into 3 elements: item1ok, item and "2ok (after trimming again).

有人知道如何使正则表达式适合这种情况吗?还是有其他解决方案可以简单地解析TSV? (我正在用C#进行此操作.)

Has anyone an idea how to make the regex fit this case? Or is there another solution to parse TSV simply? (I'm doing this in C#).

推荐答案

您可以使用

You could use the TextFieldParser. This is technically a VB assembly, but you can use it even in C# by referencing the Microsoft.VisualBasic.FileIO assembly.

上面链接中的示例甚至显示了在制表符分隔的文件中使用它的情况.

The example at the link above even shows using it on a tab-separated file.

这篇关于解析TSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆