为什么 Python 的 `re.split()` 不在零长度匹配上拆分? [英] Why doesn't Python's `re.split()` split on zero-length matches?
问题描述
Python 中(否则非常强大)re
模块的一个特殊怪癖是 re.split()
永远不会在零长度匹配上拆分字符串,例如,如果我想沿单词边界拆分字符串:><预><代码>>>>re.split(r"\s+|\b", "按单词拆分,保留标点符号!")['拆分'、'沿着'、'单词'、'保留'、'标点符号!']
代替
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']
为什么会有这个限制?是设计的吗?其他正则表达式的行为是否像这样?
这是一个已经做出的设计决定,并且可以采用任何一种方式.Tim Peters 发表了这篇文章来解释:
<块引用>例如,如果你用模式 x* 分割 "abc",你会怎么做?预计?模式在 4 个位置匹配(长度为 0),但我敢打赌大多数人会惊讶于
['', 'a', 'b', 'c', '']
返回而不是(因为他们确实得到)
['abc']
不过有些人不同意他的观点.由于向后兼容性问题,Guido van Rossum 不希望更改.他确实说:
<块引用>不过,我可以添加一个标志来启用此行为.
编辑:
Jan Burgy 发布了一个解决方法:
<预><代码>>>>s = "按单词拆分,保留标点符号!">>>re.sub(r"\s+|\b", '\f', s).split('\f')['', 'Split', 'along', 'words', ',', 'preserve', '标点符号', '!']其中 '\f'
可以替换为任何未使用的字符.
One particular quirk of the (otherwise quite powerful) re
module in Python is that re.split()
will never split a string on a zero-length match, for example if I want to split a string along word boundaries:
>>> re.split(r"\s+|\b", "Split along words, preserve punctuation!")
['Split', 'along', 'words,', 'preserve', 'punctuation!']
instead of
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']
Why does it have this limitation? Is it by design? Do other regex flavors behave like this?
It's a design decision that was made, and could have gone either way. Tim Peters made this post to explain:
For example, if you split "abc" by the pattern x*, what do you expect? The pattern matches (with length 0) at 4 places, but I bet most people would be surprised to get
['', 'a', 'b', 'c', '']
back instead of (as they do get)
['abc']
Some others disagree with him though. Guido van Rossum doesn't want it changed due to backwards compatibility issues. He did say:
I'm okay with adding a flag to enable this behavior though.
Edit:
There is a workaround posted by Jan Burgy:
>>> s = "Split along words, preserve punctuation!"
>>> re.sub(r"\s+|\b", '\f', s).split('\f')
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']
Where '\f'
can be replaced by any unused character.
这篇关于为什么 Python 的 `re.split()` 不在零长度匹配上拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!