正则表达式获取括号外的所有文本 [英] regex to get all text outside of brackets

查看:35
本文介绍了正则表达式获取括号外的所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用正则表达式抓取括号外的任何文本.

I'm trying to grab any text outside of brackets with a regex.

示例字符串

Josie Smith [3996 COLLEGE AVENUE, SOMETOWN, MD 21003]Mugsy Dog Smith[2560 橡树街,格伦米德,威斯康星州 14098]

Josie Smith [3996 COLLEGE AVENUE, SOMETOWN, MD 21003]Mugsy Dog Smith [2560 OAK ST, GLENMEADE, WI 14098]

我可以通过以下方式成功获取里面方括号中的文本:

I'm able to get the text inside the square brackets successfully with:

addrs = re.findall(r"\[(.*?)\]", example_str)
print addrs
[u'3996 COLLEGE AVENUE, SOMETOWN, MD 21003',u'2560 OAK ST, GLENMEADE, WI 14098']    

但是我在获取方括号之外的任何之外时遇到了麻烦.我尝试过类似以下内容:

but I'm having trouble getting anything outside of the square brackets. I've tried something like the following:

names = re.findall(r"(.*?)\[.*\]+", example_str)

但这只能找到名字:

print names
[u'Josie Smith ']

到目前为止,我只看到了一个包含一到两个 name [address] 组合的字符串,但我假设一个字符串中可以有任意数量的组合.

So far I've only seen a string containing one to two name [address] combos, but I'm assuming there could be any number of them in a string.

推荐答案

如果没有嵌套括号,你可以这样做:

If there are no nested brackets, you can just do this:

re.findall(r'(.*?)\[.*?\]', example_str)

<小时>

然而,在这里你甚至不需要正则表达式.只是在括号上分开:


However, you don't even really need a regex here. Just split on brackets:

(s.split(']')[-1] for s in example_str.split('['))

<小时>

您的尝试没有成功的唯一原因:


The only reason your attempt didn't work:

re.findall(r"(.*?)\[.*\]+", example_str)

... 是您在括号内进行非贪婪匹配,这意味着它捕获了从第一个左括号到最后一个右括号的所有内容,而不是仅捕获第一对括号.

… is that you were doing a non-greedy match within the brackets, which means it was capturing everything from the first open bracket to the last close bracket, instead of capturing just the first pair of brackets.

另外,最后的 + 似乎是错误的.如果你有 'abc [def][ghi] jkl[mno]',你想找回 ['abc', '', 'jkl'],还是 ['abc', 'jkl']?如果是前者,不要添加+.如果是后者,那么做——但是你需要把整个括号内的模式放在一个非捕获组中:r'(.*?)(?:\[.*?\])+.

Also, the + on the end seems wrong. If you had 'abc [def][ghi] jkl[mno]', would you want to get back ['abc ', '', ' jkl'], or ['abc ', ' jkl']? If the former, don't add the +. If it's the latter, do—but then you need to put the whole bracketed pattern in a non-capturing group: r'(.*?)(?:\[.*?\])+.

如果最后一个括号后可能有额外的文本,split 方法可以正常工作,或者您可以使用 re.split 而不是 re.findall...但是如果你想调整你的原始正则表达式来处理它,你可以.

If there might be additional text after the last bracket, the split method will work fine, or you could use re.split instead of re.findall… but if you want to adjust your original regex to work with that, you can.

在英语中,您想要的是括号括起来的子字符串之前的任何(非贪婪)子字符串字符串的结尾,对吗?

In English, what you want is any (non-greedy) substring before a bracket-enclosed substring or the end of the string, right?

因此,您需要在 \[.*?\]$ 之间进行交替.当然,您需要对其进行分组以编写交替,并且您不想捕获该组.所以:

So, you need an alternation between \[.*?\] and $. Of course you need to group that in order to write the alternation, and you don't want to capture the group. So:

re.findall(r"(.*?)(?:\[.*?\]|$)", example_str)

这篇关于正则表达式获取括号外的所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆