Python 正则表达式:对空字符串的模式匹配进行拆分 [英] Python regex: splitting on pattern match that is an empty string

查看:30
本文介绍了Python 正则表达式:对空字符串的模式匹配进行拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 re 模块,我似乎无法拆分空字符串的模式匹配:

<预><代码>>>>re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')['foobarbarbazbar']

也就是说,即使找到匹配项,如果是空字符串,即使re.split也无法拆分字符串.

re.split 的文档 似乎支持我的结果.

对于这种特殊情况很容易找到解决方法":

<预><代码>>>>re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')['foobar', 'barbaz', 'bar']

但这是一种容易出错的方法,因为那样我必须注意已经包含我要拆分的子字符串的字符串:

<预><代码>>>>re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')['foobar', 'bar', '', 'bar']

有没有更好的方法来分割与 re 模块匹配的空模式?另外,为什么 re.split 一开始就不允许我这样做?我知道其他与正则表达式一起使用的拆分算法是可能的;例如,我可以使用 JavaScript 的内置 String.prototype.split().

解决方案

不幸的是 split 需要一个非零宽度匹配,但还没有被修复,因为很多不正确的代码取决于当前的行为,例如使用 [something]* 作为正则表达式.使用这样的模式现在将生成一个 FutureWarning 并且那些永远可以拆分任何东西,从 Python 3.5 开始抛出一个 ValueError :

<预><代码>>>>re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/usr/lib/python3.6/re.py",第 212 行,拆分返回 _compile(pattern, flags).split(string, maxsplit)ValueError: split() 需要一个非空的模式匹配.

这个想法是,经过一段时间的警告后,可以更改行为,以便您的正则表达式再次起作用.

<小时>

如果您不能使用regex模块,您可以使用re.finditer()编写自己的拆分函数:

def megasplit(pattern, string):splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))开始 = [0] + [i[1] for i in splits]结束 = [i[0] for i in splits] + [len(string)]返回 [string[start:end] 开始,以 zip 结束(开始,结束)]打印(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))打印(megasplit(r'o','foobarbarbazbar'))

如果您确定匹配项仅为零宽度,则可以使用拆分的开头来简化代码:

导入重新def zerowidthsplit(模式,字符串):splits = list(m.start() for m in re.finditer(pattern, string))开始 = [0] + 分裂结束 = 拆分 + [ len(string) ]返回 [string[start:end] 开始,以 zip 结束(开始,结束)]打印(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))

With the re module, it seems that I am unable to split on pattern matches that are empty strings:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
['foobarbarbazbar']

In other words, even if a match is found, if it's the empty string, even re.split cannot split the string.

The docs for re.split seem to support my results.

A "workaround" was easy enough to find for this particular case:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')
['foobar', 'barbaz', 'bar']

But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I'm splitting on:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')
['foobar', 'bar', '', 'bar']

Is there any better way to split on an empty pattern match with the re module? Additionally, why does re.split not allow me to do this in the first place? I know it's possible with other split algorithms that work with regex; for example, I am able to do this with JavaScript's built-in String.prototype.split().

解决方案

It is unfortunate that the split requires a non-zero-width match, but it hasn't been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]*as the regex. Use of such patterns will now generate a FutureWarning and those that never can split anything, throw a ValueError from Python 3.5 onwards:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/re.py", line 212, in split
    return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.

The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again.


If you can't use the regex module, you can write your own split function using re.finditer():

def megasplit(pattern, string):
    splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
    starts = [0] + [i[1] for i in splits]
    ends = [i[0] for i in splits] + [len(string)]
    return [string[start:end] for start, end in zip(starts, ends)]

print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
print(megasplit(r'o', 'foobarbarbazbar'))

If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code:

import re

def zerowidthsplit(pattern, string):
    splits = list(m.start() for m in re.finditer(pattern, string))
    starts = [0] + splits
    ends = splits + [ len(string) ]
    return [string[start:end] for start, end in zip(starts, ends)]

print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))

这篇关于Python 正则表达式:对空字符串的模式匹配进行拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆