匹配所有utf-8/unicode小写字母形式的正则表达式是什么 [英] What is the proper regular expression to match all utf-8/unicode lowercase letter forms
问题描述
我想匹配拉丁文本块中的所有所有小写字母形式.平凡的"[a-z]"仅匹配U + 0061和U + 007A之间的字符,而不匹配所有其他小写形式.
我想匹配所有小写字母,最重要的是,匹配EFIGS语言中使用的拉丁语块中所有带重音的小写字母.
[a-zà-ý]是一个开始,但仍然有大量其他小写字符(请参见
Python当前不支持正则表达式中的Unicode属性.请参阅此答案以获取 Unicode标准的字符属性"一章.或参见此页面,以获取有关在正则表达式中使用Unicode的详细说明.> I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms. I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages. [a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this? FYI I'm using Python, but I suspect that this problem is cross-language. Python's builtin "islower()" method seems to do the right checking:
Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them. Using such a library, you could use Every character in the Unicode standard is in exactly one category. 这篇关于匹配所有utf-8/unicode小写字母形式的正则表达式是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!lower = ''
for c in xrange(0,2**16):
if unichr(c).islower():
lower += unichr(c)
print lower
lower = ''
for c in xrange(0,2**16):
if unichr(c).islower():
lower += unichr(c)
print lower
\p{Ll}
to match any lowercase letter in a Unicode string.\p{Ll}
is the category of lowercase letters, while \p{L}
comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.