Python 正则表达式引擎——“后视需要固定宽度的模式"错误 [英] Python Regex Engine - "look-behind requires fixed-width pattern" Error
问题描述
我正在尝试处理 CSV 格式的字符串中不匹配的双引号.
准确地说,
它"不没有"意义,嗯,它"
应该更正为
它"确实"不是"制作"感觉",嗯,确实"它"
所以基本上我想做的是
<块引用>替换所有的 ' "'
- 前面没有行首或逗号(和)
- 后面没有逗号或行尾
带有 ' ""'
为此,我使用以下正则表达式
(?
问题在于 Ruby 正则表达式引擎 ( http://www.rubular.com/ ) 能够解析正则表达式,python 正则表达式引擎 (https://pythex.org/ , http://www.pyregex.com/) 抛出以下错误
无效的正则表达式:look-behind 需要固定宽度的模式
使用 python 2.7.3,它会抛出
sre_constants.error: 后视需要固定宽度的模式
谁能告诉我这里有什么讨厌的蟒蛇?
====================================================================================
按照 Tim 的回应,我得到了多行字符串的以下输出
<预><代码>>>>str =""它"并不有意义",好吧,它"是有意义的"....它"不有意义",嗯,它"是有意义的"....它"不有意义",嗯,它"是有意义的"....它"不有意义",嗯,它"是有意义的".">>>re.sub(r'\b\s*"(?!,|$)', '" "', str)' 它"确实"不是"制作"感觉",嗯,确实"它"\n"它确实"不是"制作"感觉",嗯,确实"它"\n"它确实"不是"制作"感觉",嗯,确实"它"\n"它确实"不是"制作"感觉",嗯,确实"它""'在每一行的末尾,在'it'旁边添加了两个双引号.
所以我对正则表达式做了一个很小的改动来处理换行.
re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
但这给出了输出
<预><代码>>>>re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)' 它"确实"不是"制作"感觉",嗯,确实"它"\n...它"确实"不是"制作"感觉",嗯,确实"它"\n...它"确实"不是"制作"感觉",嗯,确实"它"\n...它"确实"不是"制作"感觉",嗯,确实"它""'最后一个'it'有两个双引号.
但我想知道为什么$"行尾字符不会标识该行已结束.
====================================================================================
最终答案是
re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)
Python lookbehind assertions 需要固定宽度,但你可以试试这个:
<预><代码>>>>s = '它"不"有意义",嗯,它"它">>>re.sub(r'\b\s*"(?!,|$)', '" "', s)'它"做"不"使"有意义",嗯,做"它"'说明:
\b # 在一个单词"的末尾开始匹配\s* # 匹配可选的空格" # 匹配引号(?!,|$) # 除非后面跟着逗号或字符串结尾
I am trying to handle un-matched double quotes within a string in the CSV format.
To be precise,
"It "does "not "make "sense", Well, "Does "it"
should be corrected as
"It" "does" "not" "make" "sense", Well, "Does" "it"
So basically what I am trying to do is to
replace all the ' " '
- Not preceded by a beginning of line or a comma (and)
- Not followed by a comma or an end of line
with ' " " '
For that I use the below regex
(?<!^|,)"(?!,|$)
The problem is while Ruby regex engines ( http://www.rubular.com/ ) are able to parse the regex, python regex engines (https://pythex.org/ , http://www.pyregex.com/) throw the following error
Invalid regular expression: look-behind requires fixed-width pattern
And with python 2.7.3 it throws
sre_constants.error: look-behind requires fixed-width pattern
Can anyone tell me what vexes python here?
==================================================================================
EDIT :
Following Tim's response, I got the below output for a multi line string
>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '
At the end of each line, next to 'it' two double-quotes were added.
So I made a very small change to the regex to handle a new-line.
re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
But this gives the output
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '
The last 'it' alone has two double-quotes.
But I wonder why the '$' end of line character will not identify that the line has ended.
==================================================================================
The final answer is
re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)
Python lookbehind assertions need to be fixed width, but you can try this:
>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'
Explanation:
\b # Start the match at the end of a "word"
\s* # Match optional whitespace
" # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
这篇关于Python 正则表达式引擎——“后视需要固定宽度的模式"错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!