Python:按所有空格字符拆分字符串 [英] Python: splitting string by all space characters

查看:32
本文介绍了Python:按所有空格字符拆分字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

python中用空格分割字符串,通常使用不带参数的字符串的split方法:

<预><代码>>>>'a\tb c\nd'.split()['A B C D']

但昨天我遇到了一个字符串,它在单词之间也使用了零宽度空间.在简短的黑魔法性能(在 JavaScript 人员中)转化我的新知识后,我想询问如何更好地拆分所有 空白字符,因为 split还不够:

<预><代码>>>>u'a\u200bc d'.split()[u'a\u200bc', 你'd']

UPD1

似乎 sth 建议的解决方案通常有效,但取决于某些操作系统设置或 Python 编译选项.很高兴知道确切原因(以及是否可以在 Windows 中打开该设置).

UPD2cptphil 发现了一个很棒的 link ,它使一切变得清晰:

<块引用>

所以我就这个问题联系了 Unicode 技术委员会,并很快收到了回复.他们指出 ZWSP 曾被认为是空白,但在 Unicode 4.0.1 中有所改变

引自 unicode 网站:

<块引用>

将 U+200B 零宽度空间从 Zs 更改为 Cf (2003.10.27)

U+200B 零宽度空间 (ZWSP) 的使用一直存在问题.此字符的功能是允许在通常不允许的位置换行,因此在功能上是具有一般类别 Cf 的格式字符.此行为在 Unicode 标准中有详细记录,并且该字符在 Unicode 字符数据库中不被视为空白字符.但是,由于历史原因,一般类别仍然是 Zs(空格分隔符),这导致该字符被滥用.ZWSP 也是唯一不是 Whitespace 的 Zs 字符.一般类别可能会导致将规则 D13 基本字符误解为允许 ZWSP 作为组合标记的基础.

提议将U+200B的总类从Zs改为Cf.

解决方案:关闭.Unicode 4.0.1版本中U+200B的总类将由Zs改为Cf.

此更改随后反映在 Python 中.u'\u200B'.isspace() 在 Python 2.5.4 和 2.6.5 中的结果是 True,在 Python 2.7.1 中它已经是 False.

对于其他空格字符,常规的 split 就足够了:

<预><代码>>>>u'a\u200Ac'.split()[u'a', u'c']

如果这对您来说还不够,请按照 Gabi Purcaru 下面的建议逐个添加字符.

解决方案

编辑

事实证明, \u200b 在技术上并未定义为 whitespace ,因此即使启用了 unicode 标志,python 也无法将其识别为匹配的 \s .因此必须将其视为非空白字符.

http://en.wikipedia.org/wiki/Whitespace_character#Unicode

http://bugs.python.org/issue13391

导入重新re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)

To split strings by spaces in python, one usually uses split method of the string without parameters:

>>> 'a\tb c\nd'.split()
['a', 'b', 'c', 'd']

But yesterday I ran across a string that used ZERO WIDTH SPACE between words as well. Having turned my new knowledge in a short black magic performance (among JavaScript folks), I would like to ask how to better split by all whitespace characters, since the split is not enough:

>>> u'a\u200bc d'.split()
[u'a\u200bc', u'd']

UPD1

it seems the solution suggested by sth gererally works but depends on some OS settings or Python compilation options. It would be nice to know the reason for sure (and if the setting can be switched on in Windows).

UPD2 cptphil found a great link that makes everything clear:

So I contacted the Unicode Technical Committee about the issue and received a promptly received a response back. They pointed that the ZWSP was, once upon a time considered white space but that was changed in Unicode 4.0.1

A quotation from unicode site:

Changing U+200B Zero Width Space from Zs to Cf (2003.10.27)

There have been persistent problems with usage of the U+200B Zero Width Space (ZWSP). The function of this character is to allow a line break at positions where it normally would not be allowed, and is thus functionally a format character with a general category of Cf. This behavior is well documented in the Unicode Standard, and the character not considered a Whitespace character in the Unicode Character Database. However, for historical reasons the general category is still Zs (Space Separator), which causes the character to be misused. ZWSP is also the only Zs character that is not Whitespace. The general category can cause misinterpretation of rule D13 Base character as allowing ZWSP as a base for combining marks.

The proposal is to change the general category of U+200B from Zs to Cf.

Resolution: Closed. The general category of U+200B will be changed from Zs to Cf in Unicode version 4.0.1.

The change was then reflected in Python. The result of u'\u200B'.isspace() in Python 2.5.4 and 2.6.5 is True, in Python 2.7.1 it is already False.

For other space characters regular split is enough:

>>> u'a\u200Ac'.split()
[u'a', u'c']

And if that is not enough for you, add characters one by one as Gabi Purcaru suggests below.

解决方案

Edit

It turns out that \u200b is not technically defined as whitespace , and so python does not recognize it as matching \s even with the unicode flag on. So it must be treated as an non-whitespace character.

http://en.wikipedia.org/wiki/Whitespace_character#Unicode

http://bugs.python.org/issue13391

import re

re.split(ur"[\u200b\s]+", "some string", flags=re.UNICODE)

这篇关于Python:按所有空格字符拆分字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆