正则表达式在csv中找到缺少的双引号 [英] Regex to find missing double quote in csv

查看:448
本文介绍了正则表达式在csv中找到缺少的双引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在处理包含非封闭双引号条目的行的csv文件。这些炸毁了csv解析器,所以我试图整理一个正则表达式来识别这些行,这样我们就可以在尝试处理它们之前从文件中删除它们。

We are processing csv files which contain lines with non-closed double quoted entries. These blow up the csv parser, so I am trying to put together a regex which will identify these lines so we can delete them from the files before trying to process them.

在下面的示例中,csv解析器到达第2行并包含第3行中第一个双引号之前的所有内容,然后尝试关闭令牌然后爆炸,因为在关闭双引号之后有非空格字符下一个逗号。

In the following example, the csv parser gets to line 2 and includes everything up to the first double quote in line 3 before trying to close out the token and then blows up because there are non-whitespace characters after the "closing" double quote before the next comma.


示例第1行,一些数据,好行,处理正常,快乐

Example Line 1,some data,"good line",processes fine,happy

示例第2行,一些数据,坏线,处理不良,不开心

Example Line 2,some data,"bad line,processes poorly,unhappy

示例第3行,一些数据,好线,死在这之前,不开心

Example Line 3,some data,"good line",dies before here,unhappy

我正在尝试做类似的事情:

I am trying to do something like:

.*,"[^(",)]*[\r\n]

这个想法是找到一行,后面跟着没有实例,后面跟着李ne end。

The idea is finding a single line with anything followed by ," without an instance of ", which follows before the line ends.

序列的否定不起作用。这样的事情怎么样?

The negation of the sequence is not working though. How is something like this done?

注意:

因为人们一直在暗示基本上检查偶数双引号,值得注意的是单个双引号csv条目可能包含一个独立的双引号(例如......,Measurement:1'2,......)。

Since people keep suggesting essentially checking for an even number of double quotes, it's worth noting that a single double-quoted csv entry could contain a standalone double quote (e.g. ...,"Measurement: 1' 2"",...).

推荐答案

根据您当前的要求(包括您对的关注测量:1'2,这将选择坏线:

With your current requirements (including your concern about "Measurement: 1' 2"", this will select the bad lines:

^.*(?:^|,)[^",]*"(?:[^",]*(?:"[^",]*")?)+(?:$|,.*)




  1. 字符串顶部的 ^ 锚点

  2. 。*(?:^ |,)吃掉字符串顶部的任何字符或逗号

  3. 我们匹配...

  4. ,一次或多次, [^,] *(?:[^,] *) ?匹配既不是或逗号的字符,也可以匹配一组平衡的引号:[^,] *

  5. 我们要么匹配字符串的结尾,要么是逗号和后面的任何内容

  1. The ^ anchors at the top of the string
  2. The .*(?:^|,) eats up any characters up to the top of the string or a comma
  3. We match a "...
  4. and, once or more times, [^",]*(?:"[^",]*")? matches characters that are neither a " or a comma, and, optionally, a balanced set of quotes: "[^",]*"
  5. We either match the end of the string, or a comma and anything that follows

A关于转义双引号的说明

在输入中,您可能在包含转义双引号的双引号字符串中输入,如下所示:abc \de如果是这样,我们需要用双引号字符串替换我们的表达式(?:[^,] *)更稳固的东西:(?:(?:\\| [^])*)

You may have, in your input, double-quoted strings that contain an escaped double quote, like this: "abc\"de" If so, we need to replace our expression for double-quoted strings (?:"[^",]*") with something more solid: (?:"(?:\\"|[^"])*")

因此整个正则表达式将成为:

Hence the whole regex would become:

^.*(?:^|,)[^",]*"(?:[^",]*(?:"(?:\\"|[^"])*")?)+(?:$|,.*)

这篇关于正则表达式在csv中找到缺少的双引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆