Python正则表达式中的反斜杠转义序列和单词边界 [英] Backslash escape sequences and word boundaries in Python regex

查看:61
本文介绍了Python正则表达式中的反斜杠转义序列和单词边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前使用 re.sub(re.escape("andrew)"), "SUB", stringVar)

预期行为:

stringVar = " andrew) "re.sub(re.escape("andrew)"), "SUB", stringVar) # 返回 " SUB "

意外行为:

stringVar = "zzzandrew)zzz"re.sub(re.escape("andrew)"), "SUB", stringVar) # 返回 "zzzSUBzzz"

所以我试图使用单词边界来修复zzzandrew)zzz",但是我的修复破坏了我的基本情况.

stringVar = " andrew) "re.sub(r'\b%s\b' % re.escape("andrew)"), "SUB", stringVar) # 打破并返回原来的stringVar

来自:https://docs.python.org/2.0/ref/strings.html -> 原始字符串并对反斜杠转义序列使用不同的规则.那么除了re.escape我还应该做什么?

解决方案

来自 python re 模块 文档

<块引用>

\b

匹配空字符串,但只在单词的开头或结尾.一个词被定义为一系列字母数字或下划线字符,所以单词的结尾由空格或非字母数字表示,非下划线字符.请注意,形式上,\b 被定义为\w 和 \W 字符之间的边界(反之亦然),或 \w 之间和字符串的开头/结尾,所以精确的字符集被视为字母数字取决于 UNICODE 和LOCALE 标志.例如,r'\bfoo\b' 匹配 'foo', 'foo.', '(foo)','酒吧foo baz' 但不是 'foobar' 或 'foo3'.

在您的情况下,单词边界被识别为在 andrew 和 ) 之间,这是第一个非字母数字非下划线字符.下面的示例说明了在转义中包含或排除 ')' 时会发生什么.

<预><代码>>>>stringVar = " 安德鲁) ">>>re.sub(r'\b%s\b' % re.escape("andrew)"), "SUB", stringVar)' 安德鲁) '>>>re.sub(r'\b%s\b' % re.escape("andrew"), "SUB", stringVar)' SUB) '>>>stringVar = "zzzandrew)zzz">>>re.sub(r'\b%s\b' % re.escape("andrew"), "SUB", stringVar)'zzzandrew)zzz'

如果您必须使用 ')' 作为转义的一部分,您可以使用如下所示的 肯定前瞻断言,如果有空格 (\s) 或非字母数字字符则匹配(\W) 在 'andrew)' 之后

<预><代码>>>>stringVar = " 安德鲁) ">>>re.sub(r'\b%s(?=\s)' % re.escape("andrew)"), "SUB", stringVar)'子'>>>stringVar = "zzzandrew)zzz">>>re.sub(r'\b%s(?=\s)' % re.escape("andrew)"), "SUB", stringVar)'zzzandrew)zzz'>>>stringVar = " 安德鲁) ">>>re.sub(r'\b%s(?=\W)' % re.escape("andrew)"), "SUB", stringVar)'子'>>>stringVar = "zzzandrew)zzz">>>re.sub(r'\b%s(?=\W)' % re.escape("andrew)"), "SUB", stringVar)'zzzandrew)zzz'

Currently using re.sub(re.escape("andrew)"), "SUB", stringVar)

Intended behavior:

stringVar = " andrew) "
re.sub(re.escape("andrew)"), "SUB", stringVar) # Returns " SUB "

Unintended behavior:

stringVar = "zzzandrew)zzz"
re.sub(re.escape("andrew)"), "SUB", stringVar) # Returns "zzzSUBzzz"

so I'm trying to use word boundaries to fix "zzzandrew)zzz", however my fix breaks my base case.

stringVar = " andrew) "
re.sub(r'\b%s\b' % re.escape("andrew)"), "SUB", stringVar) # Breaks and returns the original stringVar

From: https://docs.python.org/2.0/ref/strings.html -> raw strings and use different rules for backslash escape sequences. So what should I do besides re.escape?

解决方案

From python re module docs

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

In you case the word boundary is recognized as between andrew and ) which is the first non-alphanumeric non-underscore character. The example below illustrates what happens if you include or exclude ')' from the escape.

>>> stringVar = " andrew) "
>>> re.sub(r'\b%s\b' % re.escape("andrew)"), "SUB", stringVar)
' andrew) '
>>> re.sub(r'\b%s\b' % re.escape("andrew"), "SUB", stringVar)
' SUB) '
>>> stringVar = "zzzandrew)zzz"
>>> re.sub(r'\b%s\b' % re.escape("andrew"), "SUB", stringVar)
'zzzandrew)zzz'

If you have to use the ')' as part of the escape you can use a positive lookahead assertion like below which matches if there is a whitespace (\s) or a non-alphanumeric character (\W) after 'andrew)'

>>> stringVar = " andrew) "
>>> re.sub(r'\b%s(?=\s)' % re.escape("andrew)"), "SUB", stringVar)
' SUB '
>>> stringVar = "zzzandrew)zzz"
>>> re.sub(r'\b%s(?=\s)' % re.escape("andrew)"), "SUB", stringVar)
'zzzandrew)zzz'
>>> stringVar = " andrew) "
>>> re.sub(r'\b%s(?=\W)' % re.escape("andrew)"), "SUB", stringVar)
' SUB '
>>> stringVar = "zzzandrew)zzz"
>>> re.sub(r'\b%s(?=\W)' % re.escape("andrew)"), "SUB", stringVar)
'zzzandrew)zzz'

这篇关于Python正则表达式中的反斜杠转义序列和单词边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆