Python:如何检查 unicode 字符串是否包含大小写字符? [英] Python: How to check if a unicode string contains a cased character?

查看:44
本文介绍了Python:如何检查 unicode 字符串是否包含大小写字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个过滤器,在其中检查 unicode(utf-8 编码)字符串是否不包含大写字符(在所有语言中).如果字符串根本不包含任何大小写字符,我也可以.

例如:你好!"不会通过过滤器,而是!"应该通过过滤器,因为!"不是大小写字符.

我计划使用 islower() 方法,但在上面的示例中,!".islower() 将返回 False.

根据 Python 文档,如果 unicode 字符串的大小写字符全部为小写且字符串至少包含一个大小写字符,python unicode 方法 islower() 返回 True,否则返回 False."

因为当字符串不包含任何大小写字符时,该方法也会返回 False,即.!",我想检查字符串是否包含任何大小写字符.

这样的事情......

string = unicode("!@#$%^", 'utf-8')#首先检查是否包含大小写字符如果不是 contains_cased(string):返回真返回 string.islower():

对 contains_cased() 函数有什么建议吗?

或者可能是不同的实现方法?

谢谢!

解决方案

这里 是 Unicode 字符类别的完整独家新闻.

信件类别包括:

Ll -- 小写卢——大写Lt -- 标题lm -- 修饰符罗——其他

注意Ll <->islower();Lu 类似;<代码>(Lu 或 Lt)<->istitle()

您可能希望阅读有关大小写的复杂讨论,其中包括对 Lm 字母的一些讨论.

盲目地将所有字母"视为大小写显然是错误的.Lo 类别包括 BMP 中的 45301 个代码点(使用 Python 2.6 计算).其中很大一部分是韩文音节、CJK 表意文字和其他东亚字符——很难理解如何将它们视为大小写".

您可能希望根据您期望的大小写字符"的(未指定)行为考虑另一种定义.这是一个简单的第一次尝试:

<预><代码>>>>cased = lambda c: c.upper() != c 或 c.lower() != c>>>sum(cased(unichr(i)) for i in xrange(65536))1970年>>>

有趣的是,有 1216 x Ll 和 937 x Lu,总共 2153 ... 进一步研究 Ll 和 Lu 真正含义的范围.

I'm doing a filter wherein I check if a unicode (utf-8 encoding) string contains no uppercase characters (in all languages). It's fine with me if the string doesn't contain any cased character at all.

For example: 'Hello!' will not pass the filter, but "!" should pass the filter, since "!" is not a cased character.

I planned to use the islower() method, but in the example above, "!".islower() will return False.

According to the Python Docs, "The python unicode method islower() returns True if the unicode string's cased characters are all lowercase and the string contained at least one cased character, otherwise, it returns False."

Since the method also returns False when the string doesn't contain any cased character, ie. "!", I want to do check if the string contains any cased character at all.

Something like this....

string = unicode("!@#$%^", 'utf-8')

#check first if it contains cased characters
if not contains_cased(string):
     return True

return string.islower():

Any suggestions for a contains_cased() function?

Or probably a different implementation approach?

Thanks!

解决方案

Here is the full scoop on Unicode character categories.

Letter categories include:

Ll -- lowercase
Lu -- uppercase
Lt -- titlecase
Lm -- modifier
Lo -- other

Note that Ll <-> islower(); similarly for Lu; (Lu or Lt) <-> istitle()

You may wish to read the complicated discussion on casing, which includes some discussion of Lm letters.

Blindly treating all "letters" as cased is demonstrably wrong. The Lo category includes 45301 codepoints in the BMP (counted using Python 2.6). A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased".

You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. Here's a simple first attempt:

>>> cased = lambda c: c.upper() != c or c.lower() != c
>>> sum(cased(unichr(i)) for i in xrange(65536))
1970
>>>

Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean.

这篇关于Python:如何检查 unicode 字符串是否包含大小写字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆