Python中的正则表达式和Unicode:sub和findall的区别 [英] Regular expressions and Unicode in Python: difference between sub and findall

查看:46
本文介绍了Python中的正则表达式和Unicode:sub和findall的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难找出 Python (2.7) 脚本中的错误.我在识别特殊字符时使用 sub 和 findall 有所不同.

代码如下:

<预><代码>>>>re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)u'Castaeda'>>>re.findall(ur"[^-' ().,\w]+", u'Castañeda', re.UNICODE)[]

当我使用 findall 时,它正确地将 ñ 视为字母字符,但是当我使用 sub 时,它会替换它——将其视为非字母字符.

我已经能够使用 findall 和 string.replace 获得正确的功能,但这似乎是一个糟糕的解决方案.另外,我想使用 re.split,但我遇到了与 re.sub 相同的问题.

预先感谢您的帮助.

解决方案

re.sub的调用签名为:

re.sub(pattern, repl, string, count=0)

所以

re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)

正在将 count 设置为 re.UNICODE,其值为 32.

尝试:

在 [57]: re.sub(ur"(?u)[^-' ().,\w]+", '', u'Castañeda')Out[57]: u'Casta\xf1eda'

(?u) 放在正则表达式的开头是在正则表达式本身中指定 re.UNICODE 标志的另一种方法.您还可以设置其他标志 (?iLmsux) 这样.(有关更多信息,请单击此链接并搜索(?iLmsux)".)

同理,re.split的调用签名为:

re.split(pattern, string, maxsplit=0)

解决方案是一样的.

I am having difficulty trying to figure out a bug in my Python (2.7) script. I am getting an difference with using sub and findall in recognizing special characters.

Here is the code:

>>> re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)
u'Castaeda'
>>> re.findall(ur"[^-' ().,\w]+", u'Castañeda', re.UNICODE)
[]

When I use findall, it correctly sees ñ as an alphabetic character, but when I use sub it replaces this--treating it as a non-alphabetic character.

I've been able to get the correct functionality using findall with string.replace, but this seems like a bad solution. Also, I want to use re.split, and I'm having the same problems as with re.sub.

Thanks in advance for the help.

解决方案

The call signature of re.sub is:

re.sub(pattern, repl, string, count=0)

So

re.sub(ur"[^-' ().,\w]+", '' , u'Castañeda', re.UNICODE)

is setting count to re.UNICODE, which has value 32.

Try instead:

In [57]: re.sub(ur"(?u)[^-' ().,\w]+", '', u'Castañeda')
Out[57]: u'Casta\xf1eda'

Placing (?u) at the beginning of the regex is an alternate way to specify the re.UNICODE flag in the regex itself. You can also set the other flags (?iLmsux) this way. (For more info click this link and search for "(?iLmsux)".)

Similarly, the call signature of re.split is:

re.split(pattern, string, maxsplit=0)

The solution is the same.

这篇关于Python中的正则表达式和Unicode:sub和findall的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆