为什么 Python 的 `re.split()` 不在零长度匹配上拆分? [英] Why doesn't Python's `re.split()` split on zero-length matches?

查看：42 发布时间：2021/7/6 19:11:50 python regex

本文介绍了为什么 Python 的 `re.split()` 不在零长度匹配上拆分?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Python 中(否则非常强大)re 模块的一个特殊怪癖是 re.split() 永远不会在零长度匹配上拆分字符串，例如，如果我想沿单词边界拆分字符串:><预><代码>>>>re.split(r"\s+|\b", "按单词拆分，保留标点符号！")['拆分'、'沿着'、'单词'、'保留'、'标点符号！']

代替

['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

为什么会有这个限制?是设计的吗?其他正则表达式的行为是否像这样?

解决方案

这是一个已经做出的设计决定，并且可以采用任何一种方式.Tim Peters 发表了这篇文章来解释:

<块引用>

例如，如果你用模式 x* 分割 "abc"，你会怎么做?预计?模式在 4 个位置匹配(长度为 0)，但我敢打赌大多数人会惊讶于

['', 'a', 'b', 'c', '']

返回而不是(因为他们确实得到)

['abc']

不过有些人不同意他的观点.由于向后兼容性问题，Guido van Rossum 不希望更改.他确实说:

<块引用>

不过，我可以添加一个标志来启用此行为.

编辑:

Jan Burgy 发布了一个解决方法:

<预><代码>>>>s = "按单词拆分，保留标点符号！">>>re.sub(r"\s+|\b", '\f', s).split('\f')['', 'Split', 'along', 'words', ',', 'preserve', '标点符号', '!']

其中 '\f' 可以替换为任何未使用的字符.

One particular quirk of the (otherwise quite powerful) re module in Python is that re.split() will never split a string on a zero-length match, for example if I want to split a string along word boundaries:

>>> re.split(r"\s+|\b", "Split along words, preserve punctuation!")
['Split', 'along', 'words,', 'preserve', 'punctuation!']

instead of

['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

Why does it have this limitation? Is it by design? Do other regex flavors behave like this?

解决方案

It's a design decision that was made, and could have gone either way. Tim Peters made this post to explain:

For example, if you split "abc" by the pattern x*, what do you expect? The pattern matches (with length 0) at 4 places, but I bet most people would be surprised to get

['', 'a', 'b', 'c', '']

back instead of (as they do get)

['abc']

Some others disagree with him though. Guido van Rossum doesn't want it changed due to backwards compatibility issues. He did say:

I'm okay with adding a flag to enable this behavior though.

Edit:

There is a workaround posted by Jan Burgy:

>>> s = "Split along words, preserve punctuation!"
>>> re.sub(r"\s+|\b", '\f', s).split('\f')
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

Where '\f' can be replaced by any unused character.

这篇关于为什么 Python 的 `re.split()` 不在零长度匹配上拆分?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么 Python 的 `re.split()` 不在零长度匹配上拆分? [英] Why doesn't Python's `re.split()` split on zero-length matches?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么 Python 的 `re.split()` 不在零长度匹配上拆分? [英] Why doesn&#39;t Python&#39;s `re.split()` split on zero-length matches?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

为什么 Python 的 `re.split()` 不在零长度匹配上拆分? [英] Why doesn't Python's `re.split()` split on zero-length matches?

登录关闭