Python 正则表达式引擎——“后视需要固定宽度的模式"错误 [英] Python Regex Engine - "look-behind requires fixed-width pattern" Error

查看:138
本文介绍了Python 正则表达式引擎——“后视需要固定宽度的模式"错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理 CSV 格式的字符串中不匹配的双引号.

准确地说,

它"不没有"意义,嗯,它"

应该更正为

它"确实"不是"制作"感觉",嗯,确实"它"

所以基本上我想做的是

<块引用>

替换所有的 ' "'

  1. 前面没有行首或逗号(和)
  2. 后面没有逗号或行尾

带有 ' ""'

为此,我使用以下正则表达式

(?

问题在于 Ruby 正则表达式引擎 ( http://www.rubular.com/ ) 能够解析正则表达式,python 正则表达式引擎 (https://pythex.org/ , http://www.pyregex.com/) 抛出以下错误

无效的正则表达式:look-behind 需要固定宽度的模式

使用 python 2.7.3,它会抛出

sre_constants.error: 后视需要固定宽度的模式

谁能告诉我这里有什么讨厌的蟒蛇?

====================================================================================

按照 Tim 的回应,我得到了多行字符串的以下输出

<预><代码>>>>str =""它"并不有意义",好吧,它"是有意义的"....它"不有意义",嗯,它"是有意义的"....它"不有意义",嗯,它"是有意义的"....它"不有意义",嗯,它"是有意义的".">>>re.sub(r'\b\s*"(?!,|$)', '" "', str)' 它"确实"不是"制作"感觉",嗯,确实"它"\n"它确实"不是"制作"感觉",嗯,确实"它"\n"它确实"不是"制作"感觉",嗯,确实"它"\n"它确实"不是"制作"感觉",嗯,确实"它""'

在每一行的末尾,在'it'旁边添加了两个双引号.

所以我对正则表达式做了一个很小的改动来处理换行.

re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)

但这给出了输出

<预><代码>>>>re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)' 它"确实"不是"制作"感觉",嗯,确实"它"\n...它"确实"不是"制作"感觉",嗯,确实"它"\n...它"确实"不是"制作"感觉",嗯,确实"它"\n...它"确实"不是"制作"感觉",嗯,确实"它""'

最后一个'it'有两个双引号.

但我想知道为什么$"行尾字符不会标识该行已结束.

====================================================================================

最终答案是

re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)

解决方案

Python lookbehind assertions 需要固定宽度,但你可以试试这个:

<预><代码>>>>s = '它"不"有意义",嗯,它"它">>>re.sub(r'\b\s*"(?!,|$)', '" "', s)'它"做"不"使"有意义",嗯,做"它"'

说明:

\b # 在一个单词"的末尾开始匹配\s* # 匹配可选的空格" # 匹配引号(?!,|$) # 除非后面跟着逗号或字符串结尾

I am trying to handle un-matched double quotes within a string in the CSV format.

To be precise,

"It "does "not "make "sense", Well, "Does "it"

should be corrected as

"It" "does" "not" "make" "sense", Well, "Does" "it"

So basically what I am trying to do is to

replace all the ' " '

  1. Not preceded by a beginning of line or a comma (and)
  2. Not followed by a comma or an end of line

with ' " " '

For that I use the below regex

(?<!^|,)"(?!,|$)

The problem is while Ruby regex engines ( http://www.rubular.com/ ) are able to parse the regex, python regex engines (https://pythex.org/ , http://www.pyregex.com/) throw the following error

Invalid regular expression: look-behind requires fixed-width pattern

And with python 2.7.3 it throws

sre_constants.error: look-behind requires fixed-width pattern

Can anyone tell me what vexes python here?

==================================================================================

EDIT :

Following Tim's response, I got the below output for a multi line string

>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '

At the end of each line, next to 'it' two double-quotes were added.

So I made a very small change to the regex to handle a new-line.

re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)

But this gives the output

>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '

The last 'it' alone has two double-quotes.

But I wonder why the '$' end of line character will not identify that the line has ended.

==================================================================================

The final answer is

re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)

解决方案

Python lookbehind assertions need to be fixed width, but you can try this:

>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'

Explanation:

\b      # Start the match at the end of a "word"
\s*     # Match optional whitespace
"       # Match a quote
(?!,|$) # unless it's followed by a comma or end of string

这篇关于Python 正则表达式引擎——“后视需要固定宽度的模式"错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆