Python 正则表达式转义运算符 \ 替换 &原始字符串 [英] Python Regex escape operator \ in substitutions & raw strings

查看:105
本文介绍了Python 正则表达式转义运算符 \ 替换 &原始字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不明白 python 正则表达式中的 scape 运算符 \ 以及原始字符串的 r' 的功能逻辑.感谢您的帮助.

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings. Some help is appreciated.

代码:

import re
text=' esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)

理论说:反斜杠字符 ('\') 表示特殊形式或允许使用特殊字符而不调用其特殊含义.

The theory says: backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.

就这个问题末尾提供的链接所解释的那样,r' 代表一个原始字符串,即符号没有特殊含义,它保持不变.

And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.

所以在上面的正则表达式中,我希望 text2 和 text3 不同,因为替换文本是 '.'在文本 2 中,即一个句点,而(原则上)文本 3 中的替换文本是 r'.'.这是一个原始字符串,即应该出现的字符串、反斜杠和句点.但它们的结果是一样的:

so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:

结果是:

text0=  esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation
text1=  esto.es  10. er- 12.23 with [  and.Other ] here is more; puntuation
text2=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
text3=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'

在我看来,r' 在替换部分和反斜杠中的工作方式不同.另一方面,我的直觉告诉我,我在这里遗漏了一些东西.

It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.

编辑 1:关注@Wiktor Stribiżew 评论.他指出(按照他的链接):

EDIT 1: Following @Wiktor Stribiżew comment. He pointed out that (following his link):

import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results

给出:

ab
a6b

这让我更加困惑.

注意:我读了这个 关于原始字符串的堆栈溢出问题,非常完整.尽管如此,它并没有谈到替换

Note: I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

推荐答案

首先,

replacement patterns ≠ regular expression patterns

我们使用正则表达式来搜索匹配项,我们使用替换模式来替换用正则表达式找到的匹配项.

We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.

注意:替换模式中唯一的特殊字符是反斜杠\.只有反斜杠必须加倍.

NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.

Python 中的替换模式语法

re.sub 文档是令人困惑,因为他们提到了可用于替换模式的字符串转义序列(如 \n\r)和正则表达式转义序列(\6>) 和那些可以同时用作正则表达式和字符串转义序列的 (\&).

The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).

我使用术语 regex 转义序列 表示由文字反斜杠 + 字符组成的转义序列,即 '\\X'r'\X',以及一个 字符串转义序列来表示一个 \ 和一个字符或一些序列的序列,它们一起形成一个有效的 字符串转义序列.它们仅在常规字符串文字中被识别.在原始字符串文字中,您只能转义 "(这就是为什么您不能用 \" 结束原始字符串文字的原因,但反冲是仍然是字符串的一部分).

I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).

因此,在替换模式中,您可以使用反向引用:

So, in a replacement pattern, you may use backreferences:

re.sub(r'\D(\d)\D', r'\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b')  # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1

您可能会看到 r'\1''\\1' 是相同的替换模式,\1.如果你使用'\1',它会被解析为一个字符串转义序列,一个八进制值001的字符.如果您忘记在明确的反向引用中使用 r 前缀,则没有问题,因为 \g 不是有效的字符串转义序列,并且 \ 转义字符保留在字符串中.阅读我链接到的文档:

You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:

与标准 C 不同,所有无法识别的转义序列都保留在字符串中不变,即反斜杠保留在结果中.

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.

因此,当您将 '\.' 作为替换字符串传递时,您实际上发送了 \. 两个字符的组合作为替换字符串,这就是为什么您在结果中得到 \..

So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.

\ 是 Python 替换模式中的特殊字符

\ is a special character in Python replacement pattern

如果你使用re.sub(r'\s+\.', r'\\.', text),你会得到和text2 和 text3 案例,参见这个演示.

If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.

发生这种情况是因为 \\,两个文字反斜杠,表示替换模式中的单个反斜杠.如果您的正则表达式模式中没有 Group 2,但在替换中传递 r'\2' 以实际替换为 \2 字符组合,你会得到一个错误.

That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.

因此,当您有动态的、用户定义的替换模式时,您需要将替换模式中的所有反斜杠都加倍,这些反斜杠旨在作为文字字符串传递:

Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:

re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)

这篇关于Python 正则表达式转义运算符 \ 替换 &amp;原始字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆