Python 正则表达式——无关的匹配 [英] Python regex -- extraneous matchings
问题描述
我想使用 -
, +=
, ==
, =
, 分割字符串>+
和空格作为分隔符.我想保留分隔符,除非它是空格.
I want to split a string using -
, +=
, ==
, =
, +
, and white-space as delimiters. I want to keep the delimiter unless it is white-space.
我尝试使用以下代码实现此目的:
I've tried to achieve this with the following code:
def tokenize(s):
import re
pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
return pattern.split(s)
print(tokenize("hello-+==== =+ there"))
我希望输出是
['hello', '-', '+=', '==', '=', '=', '+', 'there']
但是我得到了
['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
这几乎是我想要的,除了有很多无关的None
和空字符串.
Which is almost what I wanted, except that there are quite a few extraneous None
s and empty strings.
为什么它会这样,我该如何改变它以获得我想要的?
Why is it behaving this way, and how might I change it to get what I want?
推荐答案
re.split 默认返回匹配之间的字符串位数组:(正如@Laurence Gonsalves 指出的,这是它的主要用途.)
re.split by default returns an array of the bits of strings that are in between the matches: (As @Laurence Gonsalves notes, this is its main use.)
['hello', '', '', '', '', '', '', '', 'there']
注意-
和+=
、+=
和==
等之间的空字符串.
Note the empty strings in between -
and +=
, +=
and ==
, etc.
如文档所述,因为您使用的是捕获组(即,因为您使用的是 (\-|\+\=|\=\=|\=|\+)
而不是 (?:\-|\+\=|\=\=|\=|\+)
,捕获组匹配的位是穿插的:
As the docs explain, because you're using a capture group (i.e., because you're using (\-|\+\=|\=\=|\=|\+)
instead of (?:\-|\+\=|\=\=|\=|\+)
, the bits that the capture group matches are interspersed:
['hello', '-', '', '+=', '', '==', '', '=', '', None, '', '=', '', '+', '', None, 'there']
None
对应于你的模式的 \s+
一半被匹配的地方;在这些情况下,捕获组没有捕获任何东西.
None
corresponds to where the \s+
half of your pattern was matched; in those cases, the capture group captured nothing.
从查看 re.split 的文档来看,我没有看到一种简单的方法可以让它在匹配之间丢弃空字符串,尽管是一个简单的列表理解(或 filter,如果您愿意)可以轻松丢弃 None
和空字符串:
From looking at the docs for re.split, I don't see an easy way to have it discard empty strings in between matches, although a simple list comprehension (or filter, if you prefer) can easily discard None
s and empty strings:
def tokenize(s):
import re
pattern = re.compile("(\-|\+\=|\=\=|\=|\+)|\s+")
return [ x for x in pattern.split(s) if x ]
最后一点:对于您到目前为止所描述的内容,这会正常工作,但根据您的项目的发展方向,您可能需要切换到合适的解析库.Python wiki 对这里的一些选项进行了很好的概述.
One last note: For what you've described so far, this will work fine, but depending on the direction your project goes, you may want to switch to a proper parsing library. The Python wiki has a good overview of some of the options here.
这篇关于Python 正则表达式——无关的匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!