如何在Python中以UTF-8获取所有空格? [英] How can I get all whitespaces in UTF-8 in Python?

查看:225
本文介绍了如何在Python中以UTF-8获取所有空格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能重复:
在Python中,如何列出POSIX扩展的正则表达式匹配的所有字符`[:space:]`?

Possible Duplicate:
In Python, how to list all characters matched by POSIX extended regex `[:space:]`?

如何在Python中获取UTF-8中所有空格的列表?包括不间断空格等.我正在使用python 2.7.

How can I get a list of all whitespaces in UTF-8 in Python? Including non-breaking space etc. I'm using python 2.7.

推荐答案

unicodedata.category 会告诉您任何给定字符的类别代码;您想要的字符具有代码Zs.除了遍历所有字符外,似乎没有其他方法可以提取类别中的字符列表:

unicodedata.category will tell you the category code for any given character; the characters you want have code Zs. There doesn't appear to be any way to extract a list of the characters within a category except by iterating over all of them:

>>> for c in xrange(sys.maxunicode+1):
...     u = unichr(c)
...     if unicodedata.category(u) == 'Zs':
...         sys.stdout.write("U+{:04X} {}\n".format(c, unicodedata.name(u)))
... 
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

(注意:如果您使用Python 3.4或更高版本进行此测试,则MONGOLIAN VOWEL SEPARATOR将不会出现在列表中.Python2.7随附了Unicode 5.2的数据;此字符已重新分类为常规类别Cf(格式控制")在Unicode 6.3中,这是Python 3.4用于其数据的版本.请参见 https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-标识符-蒙古语元音分隔符/

(Note: if you do this test using Python 3.4 or later, MONGOLIAN VOWEL SEPARATOR will not appear in the list. Python 2.7 shipped with data from Unicode 5.2; this character was reclassified as general category Cf ("formatting control") in Unicode 6.3, which is the version that Python 3.4 used for its data. See https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/ and https://www.unicode.org/L2/L2013/13004-vowel-sep-change.pdf for more detail than you probably require.)

您可能还希望添加类别ZlZp,这些类别会添加

You may also want to include categories Zl and Zp, which adds

U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

几乎可以肯定,您确实希望包括通常被认为是空格的所有ASCII控制字符-出于历史原因(我想),这些字符属于Cc类别.

And you almost certainly do want to include all of the ASCII control characters that are normally considered whitespace -- for historical reasons (I presume), these are in category Cc.

U+0009 CHARACTER TABULATION  ('\t')
U+000A LINE FEED (LF)        ('\n')
U+000B LINE TABULATION       ('\v')
U+000C FORM FEED (FF)        ('\r')
U+000D CARRIAGE RETURN (CR)  ('\f')

其他60余个Cc字符不应 视为空格,即使 的正式名称听起来像是空格,也应将它们视为空格.例如,U+0085 NEXT LINE的官方含义几乎从未在野外遇到过.从Windows-1252到U+2026 HORIZONTAL ELLIPSIS的UTF-8的错误转换的可能性更大.

The other 60-odd Cc characters should not be considered whitespace, even if their official name makes it sound like they are whitespace. For instance, U+0085 NEXT LINE is almost never encountered in the wild with its official meaning; it's far more likely to be the result of an erroneous conversion from Windows-1252 to UTF-8 of U+2026 HORIZONTAL ELLIPSIS.

一个密切相关的问题是"\s在Python正则表达式中匹配什么?"同样,回答这个问题的最佳方法是遍历所有字符:

A closely-related question is "what does \s match in a Python regular expression?" Again the best available way to answer this question is to iterate over all characters:

>>> s = re.compile(ru"^\s$", re.UNICODE)
>>> for c in range(sys.maxunicode+1):
...   u = unichr(c)
...   if s.match(u):
...      sys.stdout.write("U+{:04X} {}\n".format(
...        c, unicodedata.name(u, "<name missing>")))
U+0009 <name missing>
U+000A <name missing>
U+000B <name missing>
U+000C <name missing>
U+000D <name missing>
U+001C <name missing>
U+001D <name missing>
U+001E <name missing>
U+001F <name missing>
U+0020 SPACE
U+0085 <name missing>
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

(我不知道为什么unicodedata.name不知道控制字符的名称.同样,如果您使用Python 3.4或更高版本进行此测试,则MONGOLIAN VOWEL SEPARATOR将不会出现在列表中.)

(I don't know why unicodedata.name doesn't know the control characters' names. Again, if you do this test using Python 3.4 or later, MONGOLIAN VOWEL SEPARATOR will not appear in the list.)

这是所有Z*字符,通常被认为是空格的所有Cc字符,以及通常被认为是空格的所有五个不是的额外字符,U + 001C,U + 001D,U + 001E,U + 001F和U + 0085.包括最后一组是一个错误,但基本上没有危害,因为使用这些字符作为任何东西也是 一个错误.

This is all of the Z* characters, all of the Cc characters that are generally agreed to be whitespace, and five extra characters that are not generally agreed to be whitespace, U+001C, U+001D, U+001E, U+001F, and U+0085. Inclusion of the last group is a bug, but a largely harmless one, since using those characters for anything is also a bug.

这篇关于如何在Python中以UTF-8获取所有空格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆