Python中的正则表达式和Unicode:sub和findall的区别 [英] Regular expressions and Unicode in Python: difference between sub and findall
问题描述
我很难找出 Python (2.7) 脚本中的错误.我在识别特殊字符时使用 sub 和 findall 有所不同.
代码如下:
<预><代码>>>>re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)u'Castaeda'>>>re.findall(ur"[^-' ().,\w]+", u'Castañeda', re.UNICODE)[]当我使用 findall 时,它正确地将 ñ 视为字母字符,但是当我使用 sub 时,它会替换它——将其视为非字母字符.
我已经能够使用 findall 和 string.replace 获得正确的功能,但这似乎是一个糟糕的解决方案.另外,我想使用 re.split,但我遇到了与 re.sub 相同的问题.
预先感谢您的帮助.
re.sub
的调用签名为:
re.sub(pattern, repl, string, count=0)
所以
re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
正在将 count
设置为 re.UNICODE
,其值为 32.
尝试:
在 [57]: re.sub(ur"(?u)[^-' ().,\w]+", '', u'Castañeda')Out[57]: u'Casta\xf1eda'
将 (?u)
放在正则表达式的开头是在正则表达式本身中指定 re.UNICODE
标志的另一种方法.您还可以设置其他标志 (?iLmsux)
这样.(有关更多信息,请单击此链接并搜索(?iLmsux)".)
同理,re.split
的调用签名为:
re.split(pattern, string, maxsplit=0)
解决方案是一样的.
I am having difficulty trying to figure out a bug in my Python (2.7) script. I am getting an difference with using sub and findall in recognizing special characters.
Here is the code:
>>> re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
u'Castaeda'
>>> re.findall(ur"[^-' ().,\w]+", u'Castañeda', re.UNICODE)
[]
When I use findall, it correctly sees ñ as an alphabetic character, but when I use sub it replaces this--treating it as a non-alphabetic character.
I've been able to get the correct functionality using findall with string.replace, but this seems like a bad solution. Also, I want to use re.split, and I'm having the same problems as with re.sub.
Thanks in advance for the help.
The call signature of re.sub
is:
re.sub(pattern, repl, string, count=0)
So
re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
is setting count
to re.UNICODE
, which has value 32.
Try instead:
In [57]: re.sub(ur"(?u)[^-' ().,\w]+", '', u'Castañeda')
Out[57]: u'Casta\xf1eda'
Placing (?u)
at the beginning of the regex is an alternate way to specify the re.UNICODE
flag in the regex itself. You can also set the other flags
(?iLmsux)
this way. (For more info click this link and search for "(?iLmsux)".)
Similarly, the call signature of re.split
is:
re.split(pattern, string, maxsplit=0)
The solution is the same.
这篇关于Python中的正则表达式和Unicode:sub和findall的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!