python 和正则表达式模块如何处理反斜杠 [英] How python and the regex module handle backslashes

查看:87
本文介绍了python 和正则表达式模块如何处理反斜杠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前从语言参考中对python 3.4正则表达式库的理解似乎与我对该模块的实验结果不符.

<小时>

我目前的理解

正则表达式引擎可以被认为是一个独立的实体,拥有自己理解的编程语言(regex).它恰好存在于 python 中,以及各种其他语言中.因此,如果您愿意,python 必须将(正则表达式)模式/代码传递给这个独立的解释器.

为了清楚起见,以下文本将使用logical length 的概念 - 它应该表示给定字符串的逻辑长度.例如,特殊字符回车 \r 将具有 len=1 因为它是单个字符.但是,2 个不同的字符(反斜杠后跟 r)\r 将具有 len=2.

1) 假设我们想在某些文本中匹配回车 \r len=1

2) 我们需要将模式 \r len=2(2 个不同的字符)提供给正则表达式引擎

3) 正则表达式引擎收到\r len=2 并解释为:匹配特殊字符回车\r len=1

4) 它继续前进并发挥魔力

问题是反斜杠字符 \ 本身被 python 解释器用作特殊的东西 - 一个用来转义其他东西的字符(如引号)

所以当我们在python中编码并且需要表达我们需要将模式\r len=2发送到内部正则表达式解释器的想法时,我们必须输入pattern ='\\r' 或者 pattern = r'\r' 表示 \r len=2.

<小时>

一切都很好......直到

我尝试了几个涉及 re.escape

的实验

<小时>

问题摘要

1) 请确认/修改我目前对正则表达式引擎的理解

2) 为什么这些假设的非教科书定义模式匹配

3) 来自 re.escape\\r 到底是怎么回事,整个我们有相同的字符串长度,但我们比较不相等,但我们在前面的 re.search 测试中匹配回车时的工作方式都是一样的"

解决方案

你需要明白,每次你写一个模式时,它首先被解释为一个字符串,然后被正则表达式引擎第二次读取和解释.让我们描述一下发生了什么:

<预><代码>>>>s='\r'

s 包含字符 CR.

<预><代码>>>>re.match('\r', s)<_sre.SRE_Match 对象;span=(0, 1), match='\r'>

这里的字符串 '\r' 是一个包含 CR 的字符串,因此给正则表达式引擎提供了一个文字 CR.

<预><代码>>>>re.match('\\r', s)<_sre.SRE_Match 对象;span=(0, 1), match='\r'>

字符串现在是文字反斜杠和文字 r,正则表达式引擎接收这两个字符,并且由于 \r 是正则表达式转义序列,也意味着 CR 字符,您也获得匹配.

<预><代码>>>>re.match('\\\r', s)<_sre.SRE_Match 对象;span=(0, 1), match='\r'>

该字符串包含一个文字反斜杠和一个文字 CR,正则表达式引擎接收 \CR,但由于 \CR 不是一个已知的正则表达式转义序列,反斜杠被忽略,你得到一个匹配.

请注意,对于正则表达式引擎,文字反斜杠是转义序列 \\(因此在模式字符串 r'\\''\\\\')

My current understanding of the python 3.4 regex library from the language reference does not seem to match up with my experiment results of the module.


My current understanding

The regular expression engine can be thought of as a separate entity with its own programming language that it understands (regex). It just happens to live inside python, among a variety of other languages. As such, python must pass (regex) pattern/code to this independent interpreter, if you will.

For clarity reasons, the following text will use the notion of logical length - which is supposed to represent how long the given string logically is. For example, the special character carriage return \r will have len=1 since it is a single character. However, the 2 distinct characters (backslash followed by an r) \r will have len=2.

1) Lets say we want to match a carriage return \r len=1 in some text

2) We need to feed the pattern \r len=2 (2 distinct characters) to the regular expression engine

3) The regular expression engine recieves \r len=2 and interprets the pattern as: match special character carriage return \r len=1

4) It goes ahead and does the magic

The problem is that the backslash character \ itself is used by the python interpreter as something special - a character meant to escape other stuff (like quotes)

So when we are coding in python and need to express the idea that we need to send the pattern \r len=2 to the internal regular expression interpreter, we must type pattern = '\\r' or alternatively pattern = r'\r' to express \r len=2.


And everything is well... until

I try a couple of experiments involving re.escape


Summary of questions

1) Please confirm/modify my current understanding of the regex engine

2) Why are these supposed non-textbook definition patterns matching

3) What on earth is going on with \\\r from re.escape, and the whole "we have the same string lengths, but we compared unequal, but we ALSO all worked the same in matching a carriage return in the previous re.search test"

解决方案

You need to understand that each time you write a pattern, it is first interpreted as a string before to be read and interpreted a second time by the regex engine. Lets describe what happens:

>>> s='\r'

s contains the character CR.

>>> re.match('\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>

Here the string '\r' is a string that contains CR, so a literal CR is given to the regex engine.

>>> re.match('\\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>

The string is now a literal backslash and a literal r, the regex engine receives these two characters and since \r is a regex escape sequence that means a CR character too, you obtain a match too.

>>> re.match('\\\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>

The string contains a literal backslash and a literal CR, the regex engine receives \ and CR, but since \CR isn't a known regex escape sequence, the backslash is ignored and you obtain a match.

Note that for the regex engine, a literal backslash is the escape sequence \\ (so in a pattern string r'\\' or '\\\\')

这篇关于python 和正则表达式模块如何处理反斜杠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆