什么是“原始字符串正则表达式"?你如何使用它? [英] What exactly is a "raw string regex" and how can you use it?

查看:69
本文介绍了什么是“原始字符串正则表达式"?你如何使用它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从关于 regex 的 python 文档中,关于 '\' 字符:

From the python documentation on regex, regarding the '\' character:

解决方案是使用 Python 的原始字符串表示法来表示正则表达模式;反斜杠没有以任何特殊方式处理以 'r' 为前缀的字符串文字.所以 r"\n" 是一个两个字符的字符串包含'\''n',而"\n"是一个单字符的字符串包含换行符.通常模式会用 Python 表示使用此原始字符串表示法的代码.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

这个原始字符串表示法是什么?如果您使用原始字符串格式,这是否意味着 "*" 被视为文字字符而不是零个或多个指示符?这显然不可能是正确的,否则正则表达式将完全失去作用.但是如果它是一个原始字符串,如果 "\n" 字面上是一个反斜杠和一个 "n",它如何识别换行符?

What is this raw string notation? If you use a raw string format, does that mean "*" is taken as a a literal character rather than a zero-or-more indicator? That obviously can't be right, or else regex would completely lose its power. But then if it's a raw string, how does it recognize newline characters if "\n" is literally a backslash and an "n"?

我不跟.

编辑赏金:

我试图了解原始字符串正则表达式如何匹配换行符、制表符和字符集,例如\w 用于单词或 \d 用于数字或所有其他内容,如果原始字符串模式不能将反斜杠识别为普通字符以外的任何东西.我真的可以举出一些很好的例子.

I'm trying to understand how a raw string regex matches newlines, tabs, and character sets, e.g. \w for words or \d for digits or all whatnot, if raw string patterns don't recognize backslashes as anything more than ordinary characters. I could really use some good examples.

推荐答案

Zarkonnen 的回答确实回答了您的问题,但不是直接回答.让我试着更直接一点,看看我能不能从扎科宁那里拿到赏金.

Zarkonnen's response does answer your question, but not directly. Let me try to be more direct, and see if I can grab the bounty from Zarkonnen.

如果您停止使用术语原始字符串正则表达式"和原始字符串模式",您可能会发现这更容易理解.这些术语将两个独立的概念混为一谈:Python 源代码中特定字符串的表示,以及该字符串表示的正则表达式.

You will perhaps find this easier to understand if you stop using the terms "raw string regex" and "raw string patterns". These terms conflate two separate concepts: the representations of a particular string in Python source code, and what regular expression that string represents.

事实上,将它们视为两种不同的编程语言会很有帮助,每种语言都有自己的语法.Python 语言有源代码,其中包括构建具有特定内容的字符串,并调用正则表达式系统.正则表达式系统具有驻留在字符串对象中并匹配字符串的源代码.两种语言都使用反斜杠作为转义字符.

In fact, it's helpful to think of these as two different programming languages, each with their own syntax. The Python language has source code that, among other things, builds strings with certain contents, and calls the regular expression system. The regular expression system has source code that resides in string objects, and matches strings. Both languages use backslash as an escape character.

首先,理解字符串是一个字符序列(即字节或 Unicode 代码点;这里的区别并不重要).在 Python 源代码中有多种表示字符串的方法.原始字符串 只是这些表示之一.如果两种表示产生相同的字符序列,则它们产生等效的行为.

First, understand that a string is a sequence of characters (i.e. bytes or Unicode code points; the distinction doesn't much matter here). There are many ways to represent a string in Python source code. A raw string is simply one of these representations. If two representations result in the same sequence of characters, they produce equivalent behaviour.

想象一个 2 个字符的字符串,由 反斜杠 字符后跟 n 字符组成.如果您知道 反斜杠 的字符值是 92,而 n 的字符值是 110,那么这个表达式会生成我们的字符串:

Imagine a 2-character string, consisting of the backslash character followed by the n character. If you know that the character value for backslash is 92, and for n is 110, then this expression generates our string:

s = chr(92)+chr(110)
print len(s), s

2 \n

传统的 Python 字符串表示法 "\n" 不会生成此字符串.相反,它生成一个带有换行符的单字符字符串.Python 文档 2.4.1.字符串文字表示,反斜杠 (\) 字符用于转义具有特殊含义的字符,例如换行符、反斜杠本身或引号字符."

The conventional Python string notation "\n" does not generate this string. Instead it generates a one-character string with a newline character. The Python docs 2.4.1. String literals say, "The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character."

s = "\n"
print len(s), s

1 
 

(注意在这个例子中换行符是不可见的,但如果你仔细看,你会在1"后面看到一个空行.)

(Note that the newline isn't visible in this example, but if you look carefully, you'll see a blank line after the "1".)

为了得到我们的两个字符的字符串,我们必须使用另一个反斜杠字符来转义原始反斜杠字符的特殊含义:

To get our two-character string, we have to use another backslash character to escape the special meaning of the original backslash character:

s = "\\n"
print len(s), s

2 \n

如果您想表示包含许多反斜杠字符的字符串怎么办?Python 文档 2.4.1.字符串文字继续,字符串文字可以选择以字母 'r' 或 'R' 为前缀;这样的字符串被称为原始字符串并使用不同的规则来解释反斜杠转义序列."这是我们的两个字符的字符串,使用原始字符串表示:

What if you want to represent strings that have many backslash characters in them? Python docs 2.4.1. String literals continue, "String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences." Here is our two-character string, using raw string representation:

s = r"\n"
print len(s), s

2 \n

所以我们有三种不同的字符串表示,都给出相同的字符串或字符序列:

So we have three different string representations, all giving the same string, or sequence of characters:

print chr(92)+chr(110) == "\\n" == r"\n"
True

现在,让我们转向正则表达式.Python 文档,7.2.re正则表达式操作 说,正则表达式使用反斜杠字符 ('\') 来表示特殊形式或允许使用特殊字符而不使用调用它们的特殊含义.这与 Python 在字符串文字中出于相同目的使用相同字符的用法相冲突......"

Now, let's turn to regular expressions. The Python docs, 7.2. reRegular expression operations says, "Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals..."

如果你想要一个匹配换行符的 Python 正则表达式对象,那么你需要一个 2 字符的字符串,由 反斜杠 字符后跟 n 字符组成.以下代码行都将 prog 设置为识别换行符的正则表达式对象:

If you want a Python regular expression object which matches a newline character, then you need a 2-character string, consisting of the backslash character followed by the n character. The following lines of code all set prog to a regular expression object which recognises a newline character:

prog = re.compile(chr(92)+chr(110))
prog = re.compile("\\n")
prog = re.compile(r"\n")

那为什么 "通常模式会用 Python 代码表示使用这种原始字符串表示法."?因为正则表达式通常是静态字符串,可以方便地表示为字符串文字.从可用的不同字符串文字符号中,当正则表达式包含 反斜杠 字符时,原始字符串是一个方便的选择.

So why is it that "Usually patterns will be expressed in Python code using this raw string notation."? Because regular expressions are frequently static strings, which are conveniently represented as string literals. And from the different string literal notations available, raw strings are a convenient choice, when the regular expression includes a backslash character.

问题

Q:表达式 re.compile(r"\s\tWord") 怎么样?A:把字符串从正则表达式编译中分离出来,分开理解,更容易理解.

Q: what about the expression re.compile(r"\s\tWord")? A: It's easier to understand by separating the string from the regular expression compilation, and understanding them separately.

s = r"\s\tWord"
prog = re.compile(s)

字符串s 包含八个字符:一个反斜杠、一个s、一个反斜杠、一个t,然后是四个字符 Word.

The string s contains eight characters: a backslash, an s, a backslash, a t, and then four characters Word.

:制表符和空格字符会发生什么变化?A:在 Python 语言级别,字符串 s 没有 tabspace 字符.它以四个字符开头:反斜杠s反斜杠t.同时,正则表达式系统将该字符串视为正则表达式语言中的源代码,这意味着匹配由空格字符、制表符和四个字符 Word 组成的字符串.

Q: What happens to the tab and space characters? A: At the Python language level, string s doesn't have tab and space character. It starts with four characters: backslash, s, backslash, t . The regular expression system, meanwhile, treats that string as source code in the regular expression language, where it means "match a string consisting of a whitespace character, a tab character, and the four characters Word.

:如果将它们视为 backlash-s 和 backslash-t,您如何匹配它们?A:如果将您"和那"这两个词做得更具体,也许问题会更清楚:正则表达式系统如何匹配表达式 backlash-s 和 backslash-t?作为任何空白字符"和制表符字符".

Q: How do you match those if that's being treated as backlash-s and backslash-t? A: Maybe the question is clearer if the words 'you' and 'that' are made more specific: how does the regular expression system match the expressions backlash-s and backslash-t? As 'any whitespace character' and as 'tab character'.

:或者如果您有 3 个字符的字符串反斜杠-n-换行符呢?A:在Python语言中,3个字符的字符串backslash-n-newline可以表示为常规字符串"\\n\n",或者raw加常规字符串r"\n" "\n",或其他方式.正则表达式系统在找到任意两个连续的 newline 字符时匹配 3 个字符的字符串反斜杠-n-newline.

Q: Or what if you have the 3-character string backslash-n-newline? A: In the Python language, the 3-character string backslash-n-newline can be represented as conventional string "\\n\n", or raw plus conventional string r"\n" "\n", or in other ways. The regular expression system matches the 3-character string backslash-n-newline when it finds any two consecutive newline characters.

注意所有示例和文档参考均针对 Python 2.7.

N.B. All examples and document references are to Python 2.7.

更新:合并了@Vladislav Zorov 和@m.buettner 的回答以及@Aerovistae 的后续问题的澄清.

Update: Incorporated clarifications from answers of @Vladislav Zorov and @m.buettner, and from follow-up question of @Aerovistae.

这篇关于什么是“原始字符串正则表达式"?你如何使用它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆