Python中的原始字符串和正则表达式 [英] Raw string and regular expression in Python

查看:150
本文介绍了Python中的原始字符串和正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在以下代码中对原始字符串有些混淆:

  import re 

text2 ='今天是2012年11月27日。 PyCon从3/13/2013开始'
text2_re = re.sub(r'(\d +)/(\d +)/(\d +)',r'\3-\1- \2',text2)
print(text2_re)#output:今天是2012-11-27。 PyCon开始2013-3-13。

print(r'(\d +)/(\d +)/(\d +)')#output:(\d +)/(\d +)/(\d + )

由于我如何理解原始字符串,没有 r strong> \ 被视为转义字符;使用 r ,反斜杠 \ 将被视为本身。但是,我在上面的代码中看不到的是:
在正则表达式第5行中,尽管有一个 r



strong>,里面的 \d 被视为一个数字[0-9] ,而不是一个反斜杠 \ 加一个字母的ð即可。



在第二行第8行中,所有字符均被视为原始字符串。



有什么区别?



附加版



我做了以下四个变体,有或没有 r

  import re 

text2 ='今天是11 /二千〇一十二分之二十七。 PyCon从3/13/2013开始'
text2_re = re.sub(r'(\d +)/(\d +)/(\d +)',r'\3-\1- \2',text2)
text2_re1 = re.sub('(\d +)/(\d +)/(\d +)',r'\3-\1-\2 ',text2)
text2_re2 = re.sub(r'(\d +)/(\d +)/(\d +)','\3-\1-\2',text2 )
text2_re3 = re.sub('(\d +)/(\d +)/(\d +)','\3-\1-\2',text2)

print(text2_re)
print(text2_re1)
print(text2_re2)
print(text2_re3)

并获得以下输出:





你能否具体说明这四种情况?

解决方案

您被一个字符串和字符串文字的区别所困惑。



字符串字面值是您在'之间放置的,python解释器解析此字符串并将其放入内存中,如果您将字符串字面值标记为原始字符串文字(使用 r'),则python解释器在放置之前不会更改该字符串的表示形式进入内存,但一旦被解析,它们的存储方式完全相同。



这意味着在内存中没有原始字符串,字符串存储在内存中,没有任何概念是否是原始的。

  r'a regex digit:\d '#一个正则表达式数字:\d 
'一个正则表达式数字:\\d'#一个正则表达式数字:\d

这两个字符串都包含 \d ,没有所以当你将这个字符串传递给 re 模块时,它看到有一个 \d 并将其视为数字,因为 re module 不知道字符串来自原始字符串文字。 / p>

在您的具体示例中,要获得一个文字反斜杠,然后是文字d,您将使用 \\d 像这样:

  import re 

text2 ='今天是11/27/2012。 PyCon从3/13/2013开始'
text2_re = re.sub(r'(\\\d +)/(\\d +)/(\\d +)',r'\ 3-\1-\2',text2)
print(text2_re)#output:今天是11/27/2012。 PyCon从3/13/2013开始。

或者,不使用原始字符串:

  import re 

text ='今天是11/27/2012。 PyCon从3/13/2013开始'
text_re = re.sub('(\\d +)/(\\d +)/(\\d +)','\\ 3-\\1-\\2',text2)
print(text_re)#output:今天是2012-11-27。 PyCon开始2013-3-13。

text2 ='今天是11/27/2012。 PyCon从3/13/2013开始。'
text2_re = re.sub('(\\\\d +)/(\\\\d +)/(\\\\ \\\d +)','\\3-\\1-\\2',text2)
print(text2_re)#output:今天是11/27/2012。 PyCon从3/13/2013开始。

我希望有所帮助。



编辑:我不想让事情复杂化,但因为 \d 不是有效的转义序列,python不会更改它,所以'\d'== r'\d'是真的。由于 \\ 是一个有效的转义序列,它被更改为 \ 所以你得到的行为'\d'=='\\d'== r'\d'



Edit2:要回答您的修改,请仔细查看每一行:

  text2_re = re.sub(r'(\d +)/(\d +)/(\d +)',r'\3-\ 1-\2',text2)

re.sub 收到两个字符串(\d +)/(\d +)/(\d +) \3-\\ \\1-\2 。希望这样就像你预期的那样。

  text2_re1 = re.sub('(\d +)/(\d + /(\d +)',r'\3-\1-\2',text2)

再次(因为 \d 不是一个有效的字符串转义,它不会改变,请参阅我的第一个编辑) re.sub 收到两个字符串(\d +)/(\d +)/(\d +) \ 3- \1-\2 。由于 \d 不会被python解释器 r'(\d +)/(\d +)/(\d + )'=='(\d +)/(\d +)/(\d +)'。如果你明白我的第一个编辑,那么希望你应该明白为什么这两个例子是相同的。

  text2_re2 = re.sub '(\d +)/(\d +)/(\d +)','\3-\1-\2',text2)

这种情况有点不同,因为 \1 \2 \3 都是有效的转义序列,它们被替换为 unicode字符,其十进制表示由数字给出。这很复杂,但它基本上归结为:

  \1#代表ascii标题开头字符
\2#代表ascii文本开头字符
\3#代表ascii文本末尾字符

这意味着 re.sub 收到第一个字符串,如前两个示例中所做的那样((\d +)/(\d +)/(\d +)),但第二个字符串实际上是< start-of-heading> /<起动的文本> /<结束文本> 。所以 re.sub 将完全替换与该第二个字符串的匹配,但由于三个( \1 \2 \3 )是可打印的字符,python只是打印一个库存占位符字符。 >

  text2_re3 = re.sub('(\d +)/(\d +)/(\d +)','\ 3-\1-\2',text2)

这样就像第三个例子,因为 r'(\d +)/(\d +)/(\d +)'=='(\d +)/(\d +)/(\d +)',如第二个例子所述。


I have some confusions regarding raw string in the following code:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
print (text2_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13.

print (r'(\d+)/(\d+)/(\d+)') #output: (\d+)/(\d+)/(\d+)

As how I understand the raw string, without r, the \ is treated as escape character; with r, the backslash \ is treated as itself literally.

However, what I cannot understand in the above code is that: In the regular expression Line 5, even though there is a r, the "\d" inside is treated as one number [0-9] instead of one backslash \ plus one letter d.

In the second print Line 8, all characters are treated as raw strings.

What is the difference?

Additional Edition:

I made the following Four variations, with or without r:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)
text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

print (text2_re)
print (text2_re1)
print (text2_re2)
print (text2_re3)

And get the following output:

Could you explain these Four situations specifically?

解决方案

You're getting confused by the difference between a string and a string literal.

A string literal is what you put between " or ' and the python interpreter parses this string and puts it into memory. If you mark your string literal as a raw string literal (using r') then the python interpreter will not change the representation of that string before putting it into memory but once they've been parsed they are stored exactly the same way.

This means that in memory there is no such thing as a raw string. Both the following strings are stored identically in memory with no concept of whether they were raw or not.

r'a regex digit: \d'  # a regex digit: \d
'a regex digit: \\d'  # a regex digit: \d

Both these strings contain \d and there is nothing to say that this came from a raw string. So when you pass this string to the re module it sees that there is a \d and sees it as a digit because the re module does not know that the string came from a raw string literal.

In your specific example, to get a literal backslash followed by a literal d you would use \\d like so:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\\d+)/(\\d+)/(\\d+)', r'\3-\1-\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

Alternatively, without using raw strings:

import re

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text_re = re.sub('(\\d+)/(\\d+)/(\\d+)', '\\3-\\1-\\2', text2)
print (text_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13.

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub('(\\\\d+)/(\\\\d+)/(\\\\d+)', '\\3-\\1-\\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

I hope that helps somewhat.

Edit: I didn't want to complicate things but because \d is not a valid escape sequence python does not change it, so '\d' == r'\d' is true. Since \\ is a valid escape sequence it gets changed to \, so you get the behaviour '\d' == '\\d' == r'\d'. Strings get confusing sometimes.

Edit2: To answer your edit, let's look at each line specifically:

text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Hopefully this behaves as you expect now.

text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

Again (because \d is not a valid string escape it doesn't get changed, see my first edit) re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Since \d doesn't get changed by the python interpreter r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)'. If you understand my first edit then hopefully you should understand why these two cases behave the same.

text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This case is a bit different because \1, \2 and \3 are all valid escape sequences, they are replaced with the unicode character whose decimal representation is given by the number. That's quite complex but it basically boils down to:

\1  # stands for the ascii start-of-heading character
\2  # stands for the ascii start-of-text character
\3  # stands for the ascii end-of-text character

This means that re.sub receives the first string as it has done in the first two examples ((\d+)/(\d+)/(\d+)) but the second string is actually <start-of-heading>/<start-of-text>/<end-of-text>. So re.sub replaces the match with that second string exactly but since none of the three (\1, \2 or \3) are printable characters python just prints a stock place-holder character instead.

text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This behaves like the third example because r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)', as explained in the second example.

这篇关于Python中的原始字符串和正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆