为什么 Python 的 `re.split()` 不在零长度匹配上拆分? [英] Why doesn't Python's `re.split()` split on zero-length matches?

查看:42
本文介绍了为什么 Python 的 `re.split()` 不在零长度匹配上拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python 中(否则非常强大)re 模块的一个特殊怪癖是 re.split() 永远不会在零长度匹配上拆分字符串,例如,如果我想沿单词边界拆分字符串:><预><代码>>>>re.split(r"\s+|\b", "按单词拆分,保留标点符号!")['拆分'、'沿着'、'单词'、'保留'、'标点符号!']

代替

['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

为什么会有这个限制?是设计的吗?其他正则表达式的行为是否像这样?

解决方案

这是一个已经做出的设计决定,并且可以采用任何一种方式.Tim Peters 发表了这篇文章来解释:

<块引用>

例如,如果你用模式 x* 分割 "abc",你会怎么做?预计?模式在 4 个位置匹配(长度为 0),但我敢打赌大多数人会惊讶于

['', 'a', 'b', 'c', '']

返回而不是(因为他们确实得到)

['abc']

不过有些人不同意他的观点.由于向后兼容性问题,Guido van Rossum 不希望更改.他确实:

<块引用>

不过,我可以添加一个标志来启用此行为.

编辑:

Jan Burgy 发布了一个解决方法:

<预><代码>>>>s = "按单词拆分,保留标点符号!">>>re.sub(r"\s+|\b", '\f', s).split('\f')['', 'Split', 'along', 'words', ',', 'preserve', '标点符号', '!']

其中 '\f' 可以替换为任何未使用的字符.

One particular quirk of the (otherwise quite powerful) re module in Python is that re.split() will never split a string on a zero-length match, for example if I want to split a string along word boundaries:

>>> re.split(r"\s+|\b", "Split along words, preserve punctuation!")
['Split', 'along', 'words,', 'preserve', 'punctuation!']

instead of

['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

Why does it have this limitation? Is it by design? Do other regex flavors behave like this?

解决方案

It's a design decision that was made, and could have gone either way. Tim Peters made this post to explain:

For example, if you split "abc" by the pattern x*, what do you expect? The pattern matches (with length 0) at 4 places, but I bet most people would be surprised to get

['', 'a', 'b', 'c', '']

back instead of (as they do get)

['abc']

Some others disagree with him though. Guido van Rossum doesn't want it changed due to backwards compatibility issues. He did say:

I'm okay with adding a flag to enable this behavior though.

Edit:

There is a workaround posted by Jan Burgy:

>>> s = "Split along words, preserve punctuation!"
>>> re.sub(r"\s+|\b", '\f', s).split('\f')
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

Where '\f' can be replaced by any unused character.

这篇关于为什么 Python 的 `re.split()` 不在零长度匹配上拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆