Python 正则表达式——无关的匹配 [英] Python regex -- extraneous matchings

查看:59
本文介绍了Python 正则表达式——无关的匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 -, +=, ==, =, 分割字符串>+ 和空格作为分隔符.我想保留分隔符,除非它是空格.

I want to split a string using -, +=, ==, =, +, and white-space as delimiters. I want to keep the delimiter unless it is white-space.

我尝试使用以下代码实现此目的:

I've tried to achieve this with the following code:

def tokenize(s):
  import re
  pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
  return pattern.split(s)

print(tokenize("hello-+==== =+  there"))

我希望输出是

['hello', '-', '+=', '==', '=', '=', '+', 'there']

但是我得到了

['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']

这几乎是我想要的,除了有很多无关的None和空字符串.

Which is almost what I wanted, except that there are quite a few extraneous Nones and empty strings.

为什么它会这样,我该如何改变它以获得我想要的?

Why is it behaving this way, and how might I change it to get what I want?

推荐答案

re.split 默认返回匹配之间的字符串位数组:(正如@Laurence Gonsalves 指出的,这是它的主要用途.)

re.split by default returns an array of the bits of strings that are in between the matches: (As @Laurence Gonsalves notes, this is its main use.)

['hello', '', '', '', '', '', '', '', 'there']

注意-+=+===等之间的空字符串.

Note the empty strings in between - and +=, += and ==, etc.

如文档所述,因为您使用的是捕获组(即,因为您使用的是 (\-|\+\=|\=\=|\=|\+) 而不是 (?:\-|\+\=|\=\=|\=|\+),捕获组匹配的位是穿插的:

As the docs explain, because you're using a capture group (i.e., because you're using (\-|\+\=|\=\=|\=|\+) instead of (?:\-|\+\=|\=\=|\=|\+), the bits that the capture group matches are interspersed:

['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']

None 对应于你的模式的 \s+ 一半被匹配的地方;在这些情况下,捕获组没有捕获任何东西.

None corresponds to where the \s+ half of your pattern was matched; in those cases, the capture group captured nothing.

从查看 re.split 的文档来看,我没有看到一种简单的方法可以让它在匹配之间丢弃空字符串,尽管是一个简单的列表理解(或 filter,如果您愿意)可以轻松丢弃 None 和空字符串:

From looking at the docs for re.split, I don't see an easy way to have it discard empty strings in between matches, although a simple list comprehension (or filter, if you prefer) can easily discard Nones and empty strings:

def tokenize(s):
  import re
  pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
  return [ x for x in pattern.split(s) if x ]

最后一点:对于您到目前为止所描述的内容,这会正常工作,但根据您的项目的发展方向,您可能需要切换到合适的解析库.Python wiki 对这里的一些选项进行了很好的概述.

One last note: For what you've described so far, this will work fine, but depending on the direction your project goes, you may want to switch to a proper parsing library. The Python wiki has a good overview of some of the options here.

这篇关于Python 正则表达式——无关的匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆