Python 正则表达式:对空字符串的模式匹配进行拆分 [英] Python regex: splitting on pattern match that is an empty string
问题描述
使用 re
模块,我似乎无法拆分空字符串的模式匹配:
也就是说,即使找到匹配项,如果是空字符串,即使re.split
也无法拆分字符串.
re.split
的文档 似乎支持我的结果.
对于这种特殊情况很容易找到解决方法":
<预><代码>>>>re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')['foobar', 'barbaz', 'bar']但这是一种容易出错的方法,因为那样我必须注意已经包含我要拆分的子字符串的字符串:
<预><代码>>>>re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')['foobar', 'bar', '', 'bar']有没有更好的方法来分割与 re
模块匹配的空模式?另外,为什么 re.split
一开始就不允许我这样做?我知道其他与正则表达式一起使用的拆分算法是可能的;例如,我可以使用 JavaScript 的内置 String.prototype.split()
.
不幸的是 split
需要一个非零宽度匹配,但还没有被修复,因为很多不正确的代码取决于当前的行为,例如使用 [something]*
作为正则表达式.使用这样的模式现在将生成一个 FutureWarning
并且那些永远可以拆分任何东西,从 Python 3.5 开始抛出一个 ValueError
:
这个想法是,经过一段时间的警告后,可以更改行为,以便您的正则表达式再次起作用.
<小时>如果您不能使用regex
模块,您可以使用re.finditer()
编写自己的拆分函数:
def megasplit(pattern, string):splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))开始 = [0] + [i[1] for i in splits]结束 = [i[0] for i in splits] + [len(string)]返回 [string[start:end] 开始,以 zip 结束(开始,结束)]打印(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))打印(megasplit(r'o','foobarbarbazbar'))
如果您确定匹配项仅为零宽度,则可以使用拆分的开头来简化代码:
导入重新def zerowidthsplit(模式,字符串):splits = list(m.start() for m in re.finditer(pattern, string))开始 = [0] + 分裂结束 = 拆分 + [ len(string) ]返回 [string[start:end] 开始,以 zip 结束(开始,结束)]打印(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
With the re
module, it seems that I am unable to split on pattern matches that are empty strings:
>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
['foobarbarbazbar']
In other words, even if a match is found, if it's the empty string, even re.split
cannot split the string.
The docs for re.split
seem to support my results.
A "workaround" was easy enough to find for this particular case:
>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')
['foobar', 'barbaz', 'bar']
But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I'm splitting on:
>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')
['foobar', 'bar', '', 'bar']
Is there any better way to split on an empty pattern match with the re
module? Additionally, why does re.split
not allow me to do this in the first place? I know it's possible with other split algorithms that work with regex; for example, I am able to do this with JavaScript's built-in String.prototype.split()
.
It is unfortunate that the split
requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]*
as the regex. Use of such patterns will now generate a FutureWarning
and those that never can split anything, throw a ValueError
from Python 3.5 onwards:
>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/re.py", line 212, in split
return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.
The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again.
If you can't use the regex
module, you can write your own split function using re.finditer()
:
def megasplit(pattern, string):
splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
starts = [0] + [i[1] for i in splits]
ends = [i[0] for i in splits] + [len(string)]
return [string[start:end] for start, end in zip(starts, ends)]
print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
print(megasplit(r'o', 'foobarbarbazbar'))
If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code:
import re
def zerowidthsplit(pattern, string):
splits = list(m.start() for m in re.finditer(pattern, string))
starts = [0] + splits
ends = splits + [ len(string) ]
return [string[start:end] for start, end in zip(starts, ends)]
print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
这篇关于Python 正则表达式:对空字符串的模式匹配进行拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!