用引号内的字符串将CSV行与分号和引号匹配 [英] Match CSV line with semicolons and quotation inside a quoted string
问题描述
我正在尝试解析单个csv文件.目前,它是通过一些在线正则表达式网页来完成的,但最终必须使用c#来实现.(作为评论中某些问题的反应)
Im trying to parse a single of a csv file. Curently it is done with some online regex webpage but in the end it has to be implemented in c#. (as reaction of some question in the comments)
我在SO上阅读了许多其他文章以自己弄清楚,但我坚持解决它.
I read a lot of other articels here on SO to figure it out by myself, but im stuck in solving it.
我的RegExp测试行如下所示(更新:引号内的引号转义):
My test line for my RegExp looks like this (UPDATE: quotes escaped inside of quoted-strings):
;;"test123; weiterer Text" ;;"Test mit" Zeichen im Spaltenwert;nächsteSpalte mit" Begrenzungszeichen;"4711"; irgendwas 123,4; 1222;"foo" test"
;;"test123;weiterer Text";;"Test mit "" Zeichen im Spaltenwert";nächste Spalte mit "" Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo""test"
- ; 是分隔符
- "是带引号的列的符号
- ; is the delimiter
- " is the sign for quoted columns
- 该行可能包含空列(分号后跟分号没有任何文字)
- 带引号的字符串可能包含引号,例如此处"Test mit"Zeichen im Spaltenwert"
- 列分隔符也可能出现在带引号的字符串中,例如:"test123; weiterer Text"
问题:
到目前为止,我通过多次谷歌搜索做了什么,而我对正则表达式的了解有限,就是这个表达式
What i have done so far with several googling and my limited understanding of regular expressions is this expression
(?< = ^ |;)(\.\" | [^;] *)| [^;] +
(?<=^|;)(\".\"|[^;]*)|[^;]+
这给出了以下结果
[0] =>
[1] =>
[2] => "test123
[3] => weiterer Text"
[4] =>
[5] => "Test mit " Zeichen im Spaltenwert"
[6] => nächste Spalte mit " Begrenzungszeichen
[7] => "4711"
[8] => irgendwas 123,4
[9] => 1222
[10] => "foo"test"
经过测试 https://www.myregextester.com/
我现在遇到的问题是元素2和3.文本
The problem i have now is at the elements 2 and 3. This text
"test123;weiterer Text"
只能是一列,但会在带引号的字符串内以分号分隔,尽管我以为我告诉表达式要匹配引号内的所有内容.
has to be one column but gets splited at the semicolon inside of the quoted string, although i thought i told the expression to match everysthing inside of quotation marks.
我们非常感谢您的帮助.预先感谢.
Any help here is highly appreciated. Thanks in advance.
推荐答案
假设一个正确的csv使用双引号进行转义("
),即可以逐行读取>
Assuming a proper csv that uses doubled quotes for escaping (""
), that is read line by line you can use
"(?:[^"]+|"")*"|[^;]+|(?<=;|^)(?=;|$)
基本上三种不同的方式来匹配列:
Basically three different ways to match a column:
-
(?:[^"] + |")*"
开头和结尾的引号之间用非引号或双引号 -
[^;] +
一系列非semikolons -
(?< =; | ^)(?=; | $)
分号之间或分号与开始/结束之间的空字段
"(?:[^"]+|"")*"
starting and closing quote with non-quotes or double quotes between[^;]+
a series of non-semikolons(?<=;|^)(?=;|$)
an empty field between semikolons or between semikolon and start/end
注意:
- 如果要在多行上下文中使用它,则必须在否定的字符类中添加
\ n
- 它不处理与引号字段连接的前导或尾随空格
请参见 https://regex101.com/r/twKZVN/1
(尽管regex 101测试PCRE模式,但所有使用的功能也都可以在.net模式中使用.
(While regex 101 tests a PCRE pattern, all features used are also available in a .net pattern.
这篇关于用引号内的字符串将CSV行与分号和引号匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!