用引号内的字符串将CSV行与分号和引号匹配 [英] Match CSV line with semicolons and quotation inside a quoted string

查看:64
本文介绍了用引号内的字符串将CSV行与分号和引号匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析单个csv文件.目前,它是通过一些在线正则表达式网页来完成的,但最终必须使用c#来实现.(作为评论中某些问题的反应)

Im trying to parse a single of a csv file. Curently it is done with some online regex webpage but in the end it has to be implemented in c#. (as reaction of some question in the comments)

我在SO上阅读了许多其他文章以自己弄清楚,但我坚持解决它.

I read a lot of other articels here on SO to figure it out by myself, but im stuck in solving it.

我的RegExp测试行如下所示(更新:引号内的引号转义):

My test line for my RegExp looks like this (UPDATE: quotes escaped inside of quoted-strings):

;;"test123; weiterer Text" ;;"Test mit" Zeichen im Spaltenwert;nächsteSpalte mit" Begrenzungszeichen;"4711"; irgendwas 123,4; 1222;"foo" test"

;;"test123;weiterer Text";;"Test mit "" Zeichen im Spaltenwert";nächste Spalte mit "" Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo""test"

  • ; 是分隔符
  • "是带引号的列的符号
    • ; is the delimiter
    • " is the sign for quoted columns
    • 问题:

      • 该行可能包含空列(分号后跟分号没有任何文字)
      • 带引号的字符串可能包含引号,例如此处"Test mit"Zeichen im Spaltenwert"
      • 列分隔符也可能出现在带引号的字符串中,例如:"test123; weiterer Text"

      到目前为止,我通过多次谷歌搜索做了什么,而我对正则表达式的了解有限,就是这个表达式

      What i have done so far with several googling and my limited understanding of regular expressions is this expression

      (?< = ^ |;)(\.\" | [^;] *)| [^;] +

      (?<=^|;)(\".\"|[^;]*)|[^;]+

      这给出了以下结果

              [0] => 
              [1] => 
              [2] => "test123
              [3] => weiterer Text"
              [4] => 
              [5] => "Test mit " Zeichen im Spaltenwert"
              [6] => nächste Spalte mit " Begrenzungszeichen
              [7] => "4711"
              [8] => irgendwas 123,4
              [9] => 1222
              [10] => "foo"test"
      

      经过测试 https://www.myregextester.com/

      我现在遇到的问题是元素2和3.文本

      The problem i have now is at the elements 2 and 3. This text

      "test123;weiterer Text"
      

      只能是一列,但会在带引号的字符串内以分号分隔,尽管我以为我告诉表达式要匹配引号内的所有内容.

      has to be one column but gets splited at the semicolon inside of the quoted string, although i thought i told the expression to match everysthing inside of quotation marks.

      我们非常感谢您的帮助.预先感谢.

      Any help here is highly appreciated. Thanks in advance.

      推荐答案

      假设一个正确的csv使用双引号进行转义(" ),即可以逐行读取

      Assuming a proper csv that uses doubled quotes for escaping (""), that is read line by line you can use

      "(?:[^"]+|"")*"|[^;]+|(?<=;|^)(?=;|$)
      

      基本上三种不同的方式来匹配列:

      Basically three different ways to match a column:

      • (?:[^"] + |")*"开头和结尾的引号之间用非引号或双引号
      • [^;] + 一系列非semikolons
      • (?< =; | ^)(?=; | $)分号之间或分号与开始/结束之间的空字段
      • "(?:[^"]+|"")*" starting and closing quote with non-quotes or double quotes between
      • [^;]+ a series of non-semikolons
      • (?<=;|^)(?=;|$) an empty field between semikolons or between semikolon and start/end

      注意:

      • 如果要在多行上下文中使用它,则必须在否定的字符类中添加 \ n
      • 它不处理与引号字段连接的前导或尾随空格

      请参见 https://regex101.com/r/twKZVN/1

      (尽管regex 101测试PCRE模式,但所有使用的功能也都可以在.net模式中使用.

      (While regex 101 tests a PCRE pattern, all features used are also available in a .net pattern.

      这篇关于用引号内的字符串将CSV行与分号和引号匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆