Python正则表达式:贪婪模式返回多个空匹配 [英] Python regex: greedy pattern returning multiple empty matches

查看:74
本文介绍了Python正则表达式:贪婪模式返回多个空匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这种模式只是为了抓取字符串中的所有内容,直到数据中的第一个潜在句子边界:

[^\.?!\r\n]*

输出:

<预><代码>>>>模式 = re.compile(r"([^\.?!\r\n]*)")>>>matches = pattern.findall("Australians go hard!!!") # 实际的源代码片段,不是对澳大利亚人的个人评论.:-)>>>打印匹配['澳大利亚人努力', '', '', '', '']

来自 Python 文档:

<块引用>

re.findall(pattern, string, flags=0)

返回字符串中模式的所有非重叠匹配,作为列表字符串.从左到右扫描字符串,并返回匹配项按照找到的顺序.如果模式中存在一个或多个组,返回组列表;这将是一个元组列表,如果模式有不止一组.结果中包含空匹配项除非他们触及另一场比赛的开始.

现在,如果从左到右扫描字符串并且 * 运算符是贪婪的,那么返回的第一个匹配是整个字符串直到感叹号是完全合理的.但是,在消耗完那部分之后,我看不到该模式如何精确地产生四次空匹配,大概是通过在d"之后向左扫描字符串.我确实理解 * 运算符意味着此模式可以匹配空字符串,我只是不知道它会如何在字母的尾随d"和前导!"之间多次执行此操作.标点符号.

添加 ^ 锚有这个效果:

<预><代码>>>>模式 = re.compile(r"^([^\.?!\r\n]*)")>>>match = pattern.findall("澳大利亚人加油!!!")>>>打印匹配['澳大利亚人努力']

由于这消除了空字符串匹配,它似乎表明所述空匹配发生在字符串的前导A"之前.但这似乎与文档中关于按找到的顺序返回的匹配项(应该是前导A"之前的匹配项)的文档相矛盾,而且恰好四个空匹配项让我困惑.

解决方案

* 量词允许模式捕获长度为零的子字符串.在您的原始代码版本中(前面没有 ^ 锚点),额外的匹配项是:

  • hard 结尾和第一个 !
  • 之间的零长度字符串
  • 第一个和第二个之间的零长度字符串!
  • 第二个和第三个之间的零长度字符串!
  • 第三个 ! 和文本结尾之间的零长度字符串

如果您喜欢此处,您可以进一步切片/切块.

在前面添加 ^ 锚点现在可以确保只有一个子字符串可以匹配模式,因为输入文本的开头恰好出现一次.

This pattern is meant simply to grab everything in a string up until the first potential sentence boundary in the data:

[^\.?!\r\n]*

Output:

>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']

From the Python documentation:

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Now, if the string is scanned left to right and the * operator is greedy, it makes perfect sense that the first match returned is the whole string up to the exclamation marks. However, after that portion has been consumed, I do not see how the pattern is producing an empty match exactly four times, presumably by scanning the string leftward after the "d". I do understand that the * operator means this pattern can match the empty string, I just don't see how it would doing that more than once between the trailing "d" of the letters and the leading "!" of the punctuation.

Adding the ^ anchor has this effect:

>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']

Since this eliminates the empty string matches, it would seem to indicate that said empty matches were occurring before the leading "A" of the string. But that would seem to contradict the documentation with respect to the matches being returned in the order found (matches before the leading "A" should have been first) and, again, exactly four empty matches baffles me.

解决方案

The * quantifier allows the pattern to capture a substring of length zero. In your original code version (without the ^ anchor in front), the additional matches are:

  • the zero-length string between the end of hard and the first !
  • the zero-length string between the first and second !
  • the zero-length string between the second and third !
  • the zero-length string between the third ! and the end of the text

You can slice/dice this further if you like here.

Adding that ^ anchor to the front now ensures that only a single substring can match the pattern, since the beginning of the input text occurs exactly once.

这篇关于Python正则表达式:贪婪模式返回多个空匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆