在层的正则表达式中匹配unicode [英] Match unicode in ply's regexes

查看:206
本文介绍了在层的正则表达式中匹配unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在匹配标识符,但是现在我遇到了一个问题:我的标识符允许包含unicode字符.因此,旧的做事方法还不够:

I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough:

t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"

我的标记语言解析器中,我通过允许除除字符外的所有其他字符来匹配Unicode字符我明确使用了,因为我的标记语言只有两个或三个字符,我需要以这种方式进行转义.

In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way.

如何将所有unicode字符与python正则表达式和ply匹配?这是个好主意吗?

How do I match all unicode characters with python regexs and ply? Also is this a good idea at all?

我想让人们在程序中使用Ω»«°foo²väliπ这样的标识符作为标识符(变量名等).哎呀!我希望人们可以用自己的语言编写程序!无论如何,如今在各种各样的地方都支持unicode,并且它应该会传播.

I'd want to let people use identifiers like Ω » « ° foo² väli π as an identifiers (variable names and such) in their programs. Heck! I want that people could write programs in their own language if it's practical! Anyway unicode is supported nowadays in wide variety of places, and it should spread.

python regexes似乎无法识别POSIX字符类.

POSIX character classes doesnt seem to be recognised by python regexes.

>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None

为了更好地解释我的需求.我需要一个正则表达式-匹配所有Unicode可打印字符但根本不匹配ASCII字符的东西.

To explain better what I need. I'd need a regex -thing that matches all the unicode printable characters but not ASCII characters at all.

r"\ w"确实做了我想要的东西,但它与«»不匹配,并且我还需要一个与数字不匹配的正则表达式.

r"\w" does a bit stuff what I want, but it does not match « », and I also need a regex that does not match numbers.

推荐答案

re 模块支持\ w语法:

the re module supports the \w syntax which:

如果设置了UNICODE,则它将与 字符[0-9_]加上其他内容 在 Unicode字符属性数据库.

If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

因此,以下示例显示了如何匹配unicode标识符:

therefore the following examples shows how to match unicode identifiers:

>>> import re
>>> m = re.compile('(?u)[^\W0-9]\w*')
>>> m.match('a')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('9')
>>> m.match('ab')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('a9')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('unicöde')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('ödipus')
<_sre.SRE_Match object at 0xb7d75410>

所以您要查找的表达式是:(?u)[^ \ W0-9] \ w *

So the expression you look for is: (?u)[^\W0-9]\w*

这篇关于在层的正则表达式中匹配unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆